Time Series Analysis

Introduction
The objective of time series analysis is to uncover a pattern in a time series and then extrapolate the pattern into the future. Being able to forecast the future is the essence of time series analysis.

The forecast is based solely on past values of the variable and/or on past forecast errors.

Why forecast?
Forecasting applies to many business situations: forecasting demand with a view to make capacity build-out decision, staff scheduling in a call center, understanding the demand for credit, determining the inventory to order in anticipation of demand, etc. Forecast timescales may differ based on needs: some situations require forecasting years ahead, while others may require forecasts for the next day, or the even the next minute.

What is a time series?
A time series is a sequence of observations on a variable measured at successive points in time or over successive periods of time.

The measurements may be taken every hour, day, week, month, year, or any other regular interval. The pattern of the data is important in understanding the series’ past behavior.

If the behavior of the times series data of the past is expected to continue in the future, it can be used as a guide in selecting an appropriate forecasting method.

Let us look at an example.

As usual, some library imports first

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (20, 9)

Loading the data

Let us load some data: https://data.seattle.gov/Transportation/Fremont-Bridge-Bicycle-Counter/65db-xm6k.

image.png

This is a picture of the bridge. The second picture shows the bicycle counter.

image.png (Photo retrieved from a Google search, credit: Jason H, 2020))

image.png
(Photo retrieved from: http://www.sheridestoday.com/blog/2015/12/21/fremont-bridge-bike-counter)

# Load the data
# You can get more information on this dataset at 
# https://data.seattle.gov/Transportation/Fremont-Bridge-Bicycle-Counter/65db-xm6k

df = pd.read_csv('https://data.seattle.gov/api/views/65db-xm6k/rows.csv')
# Review the column names
df.columns
Index(['Date', 'Fremont Bridge Sidewalks, south of N 34th St',
       'Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk',
       'Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk'],
      dtype='object')
df
Date Fremont Bridge Sidewalks, south of N 34th St Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk
0 08/01/2022 12:00:00 AM 23.0 7.0 16.0
1 08/01/2022 01:00:00 AM 12.0 5.0 7.0
2 08/01/2022 02:00:00 AM 3.0 0.0 3.0
3 08/01/2022 03:00:00 AM 5.0 2.0 3.0
4 08/01/2022 04:00:00 AM 10.0 2.0 8.0
... ... ... ... ...
95635 08/31/2023 07:00:00 PM 224.0 72.0 152.0
95636 08/31/2023 08:00:00 PM 142.0 59.0 83.0
95637 08/31/2023 09:00:00 PM 67.0 35.0 32.0
95638 08/31/2023 10:00:00 PM 43.0 18.0 25.0
95639 08/31/2023 11:00:00 PM 12.0 8.0 4.0

95640 rows × 4 columns

# df.to_excel('Bridge_crossing_data_07Nov2023.xlsx')

We have hourly data on bicycle crossings with three columns, Total = East + West sidewalks. Our data is from 2012 all the way to July 2021, totaling 143k+ rows.

For doing time series analysis with Pandas, the data frame's index should be equal to the datetime for the row.

For convenience, we also rename the column names to be ['Total', 'East', 'West'].

# Set the index of the time series
df.index = pd.DatetimeIndex(df.Date)
# Now drop the Date column as it is a part of the index
df.drop(columns='Date', inplace=True)
df.head()
Fremont Bridge Sidewalks, south of N 34th St Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk
Date
2022-08-01 00:00:00 23.0 7.0 16.0
2022-08-01 01:00:00 12.0 5.0 7.0
2022-08-01 02:00:00 3.0 0.0 3.0
2022-08-01 03:00:00 5.0 2.0 3.0
2022-08-01 04:00:00 10.0 2.0 8.0
# Rename the columns to make them simpler to use
df.columns = ['Total', 'East', 'West']

Data Exploration

df.shape
(95640, 3)
# Check the maximum and the minimum dates in our data
print(df.index.max())
print(df.index.min())
2023-08-31 23:00:00
2012-10-03 00:00:00
# Let us drop NaN values
df.dropna(inplace=True)
df.shape
(95614, 3)
# Let us look at some sample rows
df.head(10)
Total East West
Date
2022-08-01 00:00:00 23.0 7.0 16.0
2022-08-01 01:00:00 12.0 5.0 7.0
2022-08-01 02:00:00 3.0 0.0 3.0
2022-08-01 03:00:00 5.0 2.0 3.0
2022-08-01 04:00:00 10.0 2.0 8.0
2022-08-01 05:00:00 27.0 5.0 22.0
2022-08-01 06:00:00 100.0 43.0 57.0
2022-08-01 07:00:00 219.0 90.0 129.0
2022-08-01 08:00:00 335.0 143.0 192.0
2022-08-01 09:00:00 212.0 85.0 127.0
# We plot the data 
# Pandas knows that this is a time-series, and creates the right plot

df.plot(kind = 'line',figsize=(12,6));

png

# Let us look at just the first 200 data points

title='Bicycle Crossings'
ylabel='Count'
xlabel='Date'


ax = df.iloc[:200,:].plot(figsize=(18,6),title=title)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

png

Resampling

resample() is a time-based groupby pandas has a simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.

The resample function is very flexible and allows you to specify many different parameters to control the frequency conversion and resampling operation.

Many functions are available as a method of the returned object, including sum, mean, std, sem, max, min, median, first, last, ohlc.

Alias Description
B business day frequency
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
MS month start frequency
Q quarter end frequency
A, Y year end frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
N nanoseconds
# Let us resample the data to be monthly

df.resample(rule='M').sum().plot(figsize = (18,6));

png

# Let us examine monthly data
# We create a new monthly dataframe

df_monthly = df.resample(rule='M').sum()
# Just to keep our analysis clean and be able to understand concepts,
# we will limit ourselves to pre-Covid data

df_precovid = df_monthly[df_monthly.index < pd.to_datetime('2019-12-31')]
df_precovid.plot(figsize = (18,6));

png

# We suppress some warnings pandas produces, more for 
# visual cleanliness than any other reason

pd.options.mode.chained_assignment = None 
df_monthly
Total East West
Date
2012-10-31 65695.0 33764.0 31931.0
2012-11-30 50647.0 26062.0 24585.0
2012-12-31 36369.0 18608.0 17761.0
2013-01-31 44884.0 22910.0 21974.0
2013-02-28 50027.0 25898.0 24129.0
... ... ... ...
2023-04-30 60494.0 23784.0 36710.0
2023-05-31 105039.0 40303.0 64736.0
2023-06-30 102158.0 38076.0 64082.0
2023-07-31 112791.0 43064.0 69727.0
2023-08-31 108541.0 38967.0 69574.0

131 rows × 3 columns

Filtering time series

Source: https://pandas.pydata.org/docs/user_guide/timeseries.html#indexing

Using the index for time series provides us the advantage of being able to filter easily by year or month.

for example, you can do df.loc['2017'] to list all observations for 2017, or df.loc['2017-02'], or df.loc['2017-02-15'].

df.loc['2017-02-15']
Total East West
Date
2017-02-15 00:00:00 4.0 3.0 1.0
2017-02-15 01:00:00 3.0 1.0 2.0
2017-02-15 02:00:00 0.0 0.0 0.0
2017-02-15 03:00:00 2.0 0.0 2.0
2017-02-15 04:00:00 2.0 1.0 1.0
2017-02-15 05:00:00 18.0 8.0 10.0
2017-02-15 06:00:00 60.0 45.0 15.0
2017-02-15 07:00:00 188.0 117.0 71.0
2017-02-15 08:00:00 262.0 152.0 110.0
2017-02-15 09:00:00 147.0 68.0 79.0
2017-02-15 10:00:00 49.0 25.0 24.0
2017-02-15 11:00:00 23.0 13.0 10.0
2017-02-15 12:00:00 12.0 7.0 5.0
2017-02-15 13:00:00 22.0 9.0 13.0
2017-02-15 14:00:00 17.0 2.0 15.0
2017-02-15 15:00:00 47.0 22.0 25.0
2017-02-15 16:00:00 99.0 29.0 70.0
2017-02-15 17:00:00 272.0 54.0 218.0
2017-02-15 18:00:00 181.0 48.0 133.0
2017-02-15 19:00:00 76.0 16.0 60.0
2017-02-15 20:00:00 43.0 14.0 29.0
2017-02-15 21:00:00 15.0 5.0 10.0
2017-02-15 22:00:00 16.0 6.0 10.0
2017-02-15 23:00:00 3.0 1.0 2.0
df.loc['2018']
Total East West
Date
2018-01-01 00:00:00 28.0 14.0 14.0
2018-01-01 01:00:00 16.0 2.0 14.0
2018-01-01 02:00:00 8.0 4.0 4.0
2018-01-01 03:00:00 2.0 2.0 0.0
2018-01-01 04:00:00 0.0 0.0 0.0
... ... ... ...
2018-12-31 19:00:00 14.0 9.0 5.0
2018-12-31 20:00:00 26.0 12.0 14.0
2018-12-31 21:00:00 14.0 7.0 7.0
2018-12-31 22:00:00 7.0 3.0 4.0
2018-12-31 23:00:00 13.0 7.0 6.0

8759 rows × 3 columns

df.loc['2018-02']
Total East West
Date
2018-02-01 00:00:00 8.0 2.0 6.0
2018-02-01 01:00:00 3.0 2.0 1.0
2018-02-01 02:00:00 0.0 0.0 0.0
2018-02-01 03:00:00 6.0 3.0 3.0
2018-02-01 04:00:00 8.0 5.0 3.0
... ... ... ...
2018-02-28 19:00:00 77.0 17.0 60.0
2018-02-28 20:00:00 35.0 7.0 28.0
2018-02-28 21:00:00 32.0 14.0 18.0
2018-02-28 22:00:00 13.0 2.0 11.0
2018-02-28 23:00:00 9.0 3.0 6.0

672 rows × 3 columns

You can also use the regular methods for filtering date ranges

df[(df.index > pd.to_datetime('1/31/2020')) & (df.index < pd.to_datetime('1/1/2022'))]
Total East West
Date
2020-01-31 01:00:00 1.0 0.0 1.0
2020-01-31 02:00:00 0.0 0.0 0.0
2020-01-31 03:00:00 0.0 0.0 0.0
2020-01-31 04:00:00 8.0 6.0 2.0
2020-01-31 05:00:00 14.0 7.0 7.0
... ... ... ...
2021-12-31 19:00:00 0.0 0.0 0.0
2021-12-31 20:00:00 0.0 0.0 0.0
2021-12-31 21:00:00 0.0 0.0 0.0
2021-12-31 22:00:00 0.0 0.0 0.0
2021-12-31 23:00:00 0.0 0.0 0.0

16821 rows × 3 columns

Sometimes, the date may be contained in a column.

In such cases, we filter as follows:

# We create a temporary dataframe to illustrate
temporary_df = df.loc['2017-01'].copy()
temporary_df.reset_index(inplace = True)
temporary_df
Date Total East West
0 2017-01-01 00:00:00 5.0 0.0 5.0
1 2017-01-01 01:00:00 19.0 5.0 14.0
2 2017-01-01 02:00:00 1.0 1.0 0.0
3 2017-01-01 03:00:00 2.0 0.0 2.0
4 2017-01-01 04:00:00 1.0 0.0 1.0
... ... ... ... ...
739 2017-01-31 19:00:00 116.0 27.0 89.0
740 2017-01-31 20:00:00 64.0 25.0 39.0
741 2017-01-31 21:00:00 32.0 19.0 13.0
742 2017-01-31 22:00:00 19.0 4.0 15.0
743 2017-01-31 23:00:00 15.0 6.0 9.0

744 rows × 4 columns

temporary_df['Date'].dt.day==2
0      False
1      False
2      False
3      False
4      False
       ...  
739    False
740    False
741    False
742    False
743    False
Name: Date, Length: 744, dtype: bool
temporary_df[temporary_df['Date'].dt.month == 1]

# or use temporary_df['Date'].dt.day and year as well
Date Total East West
0 2017-01-01 00:00:00 5.0 0.0 5.0
1 2017-01-01 01:00:00 19.0 5.0 14.0
2 2017-01-01 02:00:00 1.0 1.0 0.0
3 2017-01-01 03:00:00 2.0 0.0 2.0
4 2017-01-01 04:00:00 1.0 0.0 1.0
... ... ... ... ...
739 2017-01-31 19:00:00 116.0 27.0 89.0
740 2017-01-31 20:00:00 64.0 25.0 39.0
741 2017-01-31 21:00:00 32.0 19.0 13.0
742 2017-01-31 22:00:00 19.0 4.0 15.0
743 2017-01-31 23:00:00 15.0 6.0 9.0

744 rows × 4 columns

Plot by month and quarter

from statsmodels.graphics.tsaplots import month_plot, quarter_plot
# Plot the months to see trends over months
month_plot(df_precovid.Total);

png

# Plot the quarter to see trends over quarters
quarter_plot(df_precovid.resample(rule='Q').Total.sum());

png

ETS Decomposition

When we decompose a time series, we are essentially expressing a belief that our data has several discrete components to it, each which can be isolated and studied separately.

Generally, time series data is split into 3 components: error, trend and seasonality (hence ‘ETS Decomposition’):
1. Seasonal component
2. Trend/cycle component
2. Residual, or error component which is not explained by the above two.

Multiplicative vs Additive Decomposition
If we assume an additive decomposition, then we can write:


where is the data,
- is the seasonal component,
- is the trend-cycle component, and
- is the remainder component at time period .

A multiplicative decomposition would be similarly written

The additive decomposition is the most appropriate if the magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the level of the time series.

When the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a multiplicative decomposition is more appropriate. Multiplicative decompositions are common with economic time series.

  • Components are additive when the components do not change over time, and
  • Components are multiplicative when their levels are changing with time.

image.png

Multiplicative

The Statsmodels library gives us the functionality to decompose time series. Below, we decompose the time series using multiplicative decomposition. Let us spend a couple of moments looking at the chart below. Note that the first panel, ‘Total’, is the sum of the other three, ie Trend, Seasonal and Resid.

# Now we decompose our time series

import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# We use the multiplicative model

result = seasonal_decompose(df_precovid['Total'], model = 'multiplicative') 
plt.rcParams['figure.figsize'] = (20, 9)
result.plot();

png

# Each of the above components are contained in our `result`
# object as trend, seasonal and error.
# Let us put them in a dataframe
ets = pd.DataFrame({'Total': df_precovid['Total'], 
              'trend': result.trend, 
              'seasonality': result.seasonal, 
              'error': result.resid}).head(20)
# ets.to_excel('ets_mul.xlsx')
ets
Total trend seasonality error
Date
2012-10-31 65695.0 NaN 1.001160 NaN
2012-11-30 50647.0 NaN 0.731912 NaN
2012-12-31 36369.0 NaN 0.536130 NaN
2013-01-31 44884.0 NaN 0.702895 NaN
2013-02-28 50027.0 NaN 0.588604 NaN
2013-03-31 66089.0 NaN 0.837828 NaN
2013-04-30 71998.0 75386.958333 0.980757 0.973785
2013-05-31 108574.0 76398.625000 1.392398 1.020650
2013-06-30 99280.0 77057.250000 1.327268 0.970710
2013-07-31 117974.0 77981.125000 1.427836 1.059543
2013-08-31 104549.0 78480.583333 1.347373 0.988712
2013-09-30 80729.0 78247.375000 1.125840 0.916396
2013-10-31 81352.0 78758.291667 1.001160 1.031736
2013-11-30 59270.0 79796.916667 0.731912 1.014822
2013-12-31 43553.0 80700.958333 0.536130 1.006629
2014-01-31 59873.0 81297.708333 0.702895 1.047762
2014-02-28 47025.0 81740.875000 0.588604 0.977386
2014-03-31 63494.0 82772.958333 0.837828 0.915565
2014-04-30 86855.0 83550.500000 0.980757 1.059948
2014-05-31 118644.0 83531.833333 1.392398 1.020071
# Check if things work in the multiplicative model

print('Total = ', 71998.0)
print('Trend * Factor for Seasonality * Factor for Error =',75386.958333 * 0.980757 * 0.973785)

Total =  71998.0
Trend * Factor for Seasonality * Factor for Error = 71998.04732763417

Additive

We do the same thing as before, except that we change the model to be additive.

result = seasonal_decompose(df_precovid['Total'], model = 'additive') 
plt.rcParams['figure.figsize'] = (20, 9)
result.plot();

png

Here, an additive model seems to make sense. This is because the residuals seem to be better centered around zero.

Obtaining the components numerically
While this is great from a visual or graphical perspective, sometimes we may need to get the actual numbers for the three decomposed components. We can do so easily - the code below provides us this data in a dataframe.

# Each of the above components are contained in our `result`
# object as trend, seasonal and error.
# Let us put them in a dataframe
ets = pd.DataFrame({'Total': df_precovid['Total'], 
              'trend': result.trend, 
              'seasonality': result.seasonal, 
              'error': result.resid}).head(20)
# ets.to_excel('ets_add.xlsx')
ets
Total trend seasonality error
Date
2012-10-31 65695.0 NaN 315.920635 NaN
2012-11-30 50647.0 NaN -22171.933532 NaN
2012-12-31 36369.0 NaN -38436.746032 NaN
2013-01-31 44884.0 NaN -24517.440476 NaN
2013-02-28 50027.0 NaN -34696.308532 NaN
2013-03-31 66089.0 NaN -13329.627976 NaN
2013-04-30 71998.0 75386.958333 -1536.224206 -1852.734127
2013-05-31 108574.0 76398.625000 32985.115079 -809.740079
2013-06-30 99280.0 77057.250000 26951.087302 -4728.337302
2013-07-31 117974.0 77981.125000 35276.733135 4716.141865
2013-08-31 104549.0 78480.583333 28635.517857 -2567.101190
2013-09-30 80729.0 78247.375000 10523.906746 -8042.281746
2013-10-31 81352.0 78758.291667 315.920635 2277.787698
2013-11-30 59270.0 79796.916667 -22171.933532 1645.016865
2013-12-31 43553.0 80700.958333 -38436.746032 1288.787698
2014-01-31 59873.0 81297.708333 -24517.440476 3092.732143
2014-02-28 47025.0 81740.875000 -34696.308532 -19.566468
2014-03-31 63494.0 82772.958333 -13329.627976 -5949.330357
2014-04-30 86855.0 83550.500000 -1536.224206 4840.724206
2014-05-31 118644.0 83531.833333 32985.115079 2127.051587
ets.describe()
Total trend seasonality error
count 20.000000 14.000000 20.000000 14.000000
mean 72844.050000 79692.997024 -5069.362252 -284.346372
std 25785.957067 2640.000168 25620.440647 3935.011207
min 36369.000000 75386.958333 -38436.746032 -8042.281746
25% 50492.000000 78047.687500 -24517.440476 -2388.509425
50% 65892.000000 79277.604167 -7432.926091 634.610615
75% 89961.250000 81630.083333 14630.701885 2240.103671
max 118644.000000 83550.500000 35276.733135 4840.724206

What is this useful for?
- Time series decomposition is primarily useful for studying time series data, and exploring historical trends over time.
- It is also useful for calculating 'seasonally adjusted' numbers, which is really just the trend number. The trend has no seasonality.
- Seasonally adjusted number = - Note that seasons are different from cycles. Cycles have no fixed length, and we can never be sure of when they begin, peak and end. The timing of cycles is unpredictable.

Moving Average and Exponentially Weighted Moving Average

Moving averages are an easy way to understand and describe time series.

By using a sliding window along which observations are averaged, they can suppress seasonality and noise, and expose the trend.

Moving averages are not generally used for forecasting, and don’t inform us about the future behavior of our time series. Their huge advantage is they are simple to understand, and explain, and get to a high level view of what is in the data.

Simple Moving Averages
Simple moving averages (SMA) tend to even out seasonality, and offer an easy way to examine the trend. Consider the 6 month and 12 month moving averages in the graphic below. SMAs are difficult to use for forecasting, and will lag by the window size.

Exponentially Weighted Moving Average (EWMA)
EWMA is a more advanced method than SMA, and puts more weight on values that occurred more recently. Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. The more recent the observation, the higher the associated weight.

This framework generates reliable forecasts quickly and for a wide range of time series, which is a great advantage and of major importance to applications in industry.

EWMA can be easily calculated using the ewm() function in pandas. The parameter adjust controls how the EWMA term is calculated.

  • When adjust=True (default), the EW function is calculated using weights . For example, the EW moving average of the series [] would be:

  • When adjust=False, the exponentially weighted function is calculated recursively:
    \end{split}

(Source: Pandas documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html)

The alpha parameter can be specified in the formula in one of four ways: - Alpha specifies the smoothing factor directly. Specify smoothing factor directly - Span corresponds to what is commonly called an "N-day Exponentially Weighted Moving Average". Specify decay in terms of span , for .
- COM (Center of mass):  Specify decay in terms of center of mass , for . - Half-life is the period of time for the exponential weight to reduce to one half. Specify decay in terms of half-life , for . If times is specified, the time unit (str or timedelta) over which an observation decays to half its value. Only applicable to mean(), and halflife value will not apply to the other functions.

.

# let us look at rolling averages & EWM together

new_df = df_precovid[['Total']]
new_df['Moving_Avg_6m'] = new_df['Total'].rolling(window=6).mean()
new_df['Moving_Avg_12m'] = new_df['Total'].rolling(window=12).mean()
new_df['EWMA12'] = new_df['Total'].ewm(span=12,adjust=False).mean() 

# Note that available EW functions include mean(), var(), std(), corr(), cov().

new_df.plot();

png

# new_df.to_excel('temp.xlsx')
2/(12+1) 
0.15384615384615385

In the graph above, the red line is the EWMA with span=12, or alpha=2/(12+1) ≈ 0.15.

EWMA has a single smoothing parameter, , and does not account for seasonality or trend. It is only suitable for data with no clear trend or seasonal pattern.

Note that we haven’t talked about forecasting yet – that comes next. We have so far only ‘fitted’ the EWMA model to a given time series. Know that EWMA is just a weighted average, with more (or less, depending on alpha) weight to recent observations.


Stationarity

What is Stationarity?
A stationary series has constant mean and variance over time. Which means there is no trend, and no seasonality either. Stationarity is important for forecasting time series because if the mean and variance are changing with the passage of time, any estimates using a regression model will start to drift very quickly as we forecast into the future.

If a time series is not stationary, we need to ‘difference’ it with itself so it becomes stationary. Differencing means you subtract the previous observation from the current observation.

How do we know if a series is stationary?
- We can examine stationarity by visually inspecting the time series.
- Or, we can run a statistical test (The Augmented Dickey-Fuller test) to check for stationarity.

Fortunately, an ARIMA model takes care of most issues with non-stationarity for us and we do not need to adjust it. However, if we are using ARMA, we do need to ensure that our series is stationary.

Let us look at two real time series to get a sense of stationarity. We import some stock price data, and also look at stock price returns. We just pick the S&P500 index, though we could have picked any listed company.

# Let us get some data.  We download the daily time series for the S&P500 for 30 months
import yfinance as yf
SPY = yf.download('SPY', start = '2013-01-01', end = '2015-06-30')
[*********************100%%**********************]  1 of 1 completed
# Clean up
SPY.index = pd.DatetimeIndex(SPY.index) # Set index
SPY = SPY.asfreq('B') # This creates rows for any missing dates
SPY.fillna(method = 'bfill', inplace=True) # Fills missing dates with last observation
SPY.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 649 entries, 2013-01-02 to 2015-06-29
Freq: B
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       649 non-null    float64
 1   High       649 non-null    float64
 2   Low        649 non-null    float64
 3   Close      649 non-null    float64
 4   Adj Close  649 non-null    float64
 5   Volume     649 non-null    float64
dtypes: float64(6)
memory usage: 35.5 KB

Example of stationary vs non-stationary time series
The top panel shows stock returns, that appear to have a mean close to zero. The bottom is stock prices, which appear to have a trend

SPY['Returns'] = (SPY['Close'].shift(1) / SPY['Close']) - 1
SPY[['Returns']].plot(figsize = (22,6));

png

SPY[['Close']].plot(figsize = (22,6));

png

Making a series stationary
If data is not stationary, ‘differencing’ can make it stationary. Differencing is subtracting the prior observation from the current one.

If the differenced time series is not stationary either, we can continue differencing till we get to a stationary time series.

The number of times we have to difference a time series to get to stationarity is the ‘order’ of differencing. This reflects the parameter in ARIMA.

We can difference a series using Pandas series.diff() function, however there are libraries available that will automatically use an appropriate value for the parameter d.

Dickey Fuller Test for Stationarity

We can run the Dickey Fuller test for stationarity - If p-value > 0.05, we decide that the dataset is not stationary. Let us run this test against our stock price time series.

When we run this test in Python, we get a cryptic output in the form of a tuple. The help text for this function shows the complete explanation for how to interpret the results:

Returns
-------
adf : float
    The test statistic.
pvalue : float
    MacKinnon's approximate p-value based on MacKinnon (1994, 2010).
usedlag : int
    The number of lags used.
nobs : int
    The number of observations used for the ADF regression and calculation
    of the critical values.
critical values : dict
    Critical values for the test statistic at the 1 %, 5 %, and 10 %
    levels. Based on MacKinnon (2010).
icbest : float
    The maximized information criterion if autolag is not None.
resstore : ResultStore, optional
    A dummy class with results attached as attributes.

For us, the second value is the p-value that we are interested in. If this is > 0.05, we decide the series is not stationary.

# Test the stock price data

from statsmodels.tsa.stattools import adfuller
adfuller(SPY['Close'])
(-1.6928673813673563,
 0.4347911128784576,
 0,
 648,
 {'1%': -3.4404817800778034,
  '5%': -2.866010569916275,
  '10%': -2.569150763698369},
 2126.1002309138994)
# Test the stock returns data  

adfuller(SPY['Returns'].dropna())
(-26.546757517762995,
 0.0,
 0,
 647,
 {'1%': -3.4404975024933813,
  '5%': -2.8660174956716795,
  '10%': -2.569154453750397},
 -4424.286299515888)

Auto-Correlation and Partial Auto-Correlation (ACF and PACF plots)

Autocorrelation in a time series is the correlation of an observation to the observations that precede it. Autocorrelation is the basis for being able to use auto regression to forecast a time series.

To calculate autocorrelation for a series, we shift the series by one step, and calculate the correlation between the two. We keep increasing the number of steps to see correlations with past periods.

Fortunately, libraries exist that allow us to do these tedious calculations and present a tidy graph.

Next, we will look at Autocorrelation plots for both our stock price series, and also the total number of bicycle crossings in Seattle.

# Autocorrelation and partial autocorrelation plots for stock prices  

plt.rc("figure", figsize=(18,4))
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plot_acf(SPY['Close']);
plot_pacf(SPY['Close']);

png

png

The shaded area represents the 95% confidence level.

PACF for ARIMA: - The PACF plot can be used to identify the value of p, the AR order. - The ACF plot can be used to identify the value of q, the MA order. - The interpretation of ACF and PACF plots to determine values of p & q for ARIMA can be complex.

Below, we see the ACF and PACF plots for the bicycle crossings. Their seasonality is quite visible.

# Autocorrelation and partial autocorrelation plots for stock prices  

plot_acf(new_df.Total);
plot_pacf(new_df.Total);

png

png

# Get raw values for auto-correlations

from statsmodels.tsa.stattools import acf
acf(new_df.Total)
array([ 1.        ,  0.77571582,  0.52756191,  0.09556274, -0.30509232,
       -0.60545323, -0.73226582, -0.64087121, -0.37044611, -0.01811217,
        0.35763185,  0.59437737,  0.7515538 ,  0.61992343,  0.40344956,
        0.06119329, -0.28745883, -0.53888819, -0.65309667, -0.57442506])
# Slightly nicer output making it easy to read lag and correlation
[(n,x ) for n, x in enumerate(acf(new_df.Total))]
[(0, 1.0),
 (1, 0.7757158156417427),
 (2, 0.5275619080263295),
 (3, 0.09556274009387859),
 (4, -0.30509232288765453),
 (5, -0.6054532313442101),
 (6, -0.732265815426328),
 (7, -0.6408712113652556),
 (8, -0.3704461111442763),
 (9, -0.018112170681472205),
 (10, 0.35763184544832965),
 (11, 0.5943773727598759),
 (12, 0.7515538007556243),
 (13, 0.6199234263452077),
 (14, 0.4034495609681654),
 (15, 0.061193291548871764),
 (16, -0.28745882651811744),
 (17, -0.5388881889155354),
 (18, -0.6530966725971313),
 (19, -0.5744250570228712)]


Granger Causality Tests

The Granger Causality tests are used to check if two time series are related with each other, specifically, given two time series, whether the time series in the second column can be used to predict the time series in the first column.

The ‘maxlag’ parameter needs to be specified and the code will identify the p-values at different lag points up to the maxlag value. If p-value<0.05 for any lag, that may be a valid predictor (or causal factor).

image.png

Example
This test is quite easy for us to run using statsmodels. As an example, we apply it to two separate time series, one showing the average daily temperature, and the other showing the average daily household power consumption. This data was adapted from a Kaggle dataset to create this illustration.

from statsmodels.tsa.stattools import grangercausalitytests
# Data adapted from: 
# https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries

df_elec = pd.read_csv('pwr_usage.csv')
df_elec.index = pd.DatetimeIndex(df_elec.Date, freq='W-SUN')
df_elec.drop(['Date'], axis = 1, inplace = True)
df_elec
Temp_avg kwh
Date
2017-01-08 75.542857 106.549
2017-01-15 71.014286 129.096
2017-01-22 64.414286 68.770
2017-01-29 56.728571 71.378
2017-02-05 66.128571 107.829
... ... ...
2019-12-08 74.371429 167.481
2019-12-15 61.242857 86.248
2019-12-22 50.000000 73.206
2019-12-29 60.128571 35.655
2020-01-05 7.185714 4.947

157 rows × 2 columns

# Check if Average Temperature can be used to predict kwh

grangercausalitytests(df_elec[["kwh", "Temp_avg"]], maxlag = 6);
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.1022  , p=0.7496  , df_denom=153, df_num=1
ssr based chi2 test:   chi2=0.1042  , p=0.7468  , df=1
likelihood ratio test: chi2=0.1042  , p=0.7469  , df=1
parameter F test:         F=0.1022  , p=0.7496  , df_denom=153, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.9543  , p=0.3874  , df_denom=150, df_num=2
ssr based chi2 test:   chi2=1.9722  , p=0.3730  , df=2
likelihood ratio test: chi2=1.9597  , p=0.3754  , df=2
parameter F test:         F=0.9543  , p=0.3874  , df_denom=150, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=1.6013  , p=0.1916  , df_denom=147, df_num=3
ssr based chi2 test:   chi2=5.0327  , p=0.1694  , df=3
likelihood ratio test: chi2=4.9523  , p=0.1753  , df=3
parameter F test:         F=1.6013  , p=0.1916  , df_denom=147, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=1.5901  , p=0.1801  , df_denom=144, df_num=4
ssr based chi2 test:   chi2=6.7579  , p=0.1492  , df=4
likelihood ratio test: chi2=6.6129  , p=0.1578  , df=4
parameter F test:         F=1.5901  , p=0.1801  , df_denom=144, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=1.2414  , p=0.2930  , df_denom=141, df_num=5
ssr based chi2 test:   chi2=6.6913  , p=0.2446  , df=5
likelihood ratio test: chi2=6.5482  , p=0.2565  , df=5
parameter F test:         F=1.2414  , p=0.2930  , df_denom=141, df_num=5

Granger Causality
number of lags (no zero) 6
ssr based F test:         F=1.0930  , p=0.3697  , df_denom=138, df_num=6
ssr based chi2 test:   chi2=7.1759  , p=0.3049  , df=6
likelihood ratio test: chi2=7.0106  , p=0.3199  , df=6
parameter F test:         F=1.0930  , p=0.3697  , df_denom=138, df_num=6
# Check if kwh can be used to predict Average Temperature 
# While we get p<0.05 at lag 3, the result is obviously absurd
grangercausalitytests(df_elec[["Temp_avg", "kwh"]], maxlag = 6);
Granger Causality
number of lags (no zero) 1
ssr based F test:         F=3.1953  , p=0.0758  , df_denom=153, df_num=1
ssr based chi2 test:   chi2=3.2580  , p=0.0711  , df=1
likelihood ratio test: chi2=3.2244  , p=0.0725  , df=1
parameter F test:         F=3.1953  , p=0.0758  , df_denom=153, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=2.8694  , p=0.0599  , df_denom=150, df_num=2
ssr based chi2 test:   chi2=5.9301  , p=0.0516  , df=2
likelihood ratio test: chi2=5.8194  , p=0.0545  , df=2
parameter F test:         F=2.8694  , p=0.0599  , df_denom=150, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=3.0044  , p=0.0324  , df_denom=147, df_num=3
ssr based chi2 test:   chi2=9.4423  , p=0.0240  , df=3
likelihood ratio test: chi2=9.1642  , p=0.0272  , df=3
parameter F test:         F=3.0044  , p=0.0324  , df_denom=147, df_num=3

Granger Causality
number of lags (no zero) 4
ssr based F test:         F=1.3019  , p=0.2722  , df_denom=144, df_num=4
ssr based chi2 test:   chi2=5.5329  , p=0.2369  , df=4
likelihood ratio test: chi2=5.4352  , p=0.2455  , df=4
parameter F test:         F=1.3019  , p=0.2722  , df_denom=144, df_num=4

Granger Causality
number of lags (no zero) 5
ssr based F test:         F=0.8068  , p=0.5466  , df_denom=141, df_num=5
ssr based chi2 test:   chi2=4.3488  , p=0.5004  , df=5
likelihood ratio test: chi2=4.2877  , p=0.5088  , df=5
parameter F test:         F=0.8068  , p=0.5466  , df_denom=141, df_num=5

Granger Causality
number of lags (no zero) 6
ssr based F test:         F=0.5381  , p=0.7785  , df_denom=138, df_num=6
ssr based chi2 test:   chi2=3.5328  , p=0.7396  , df=6
likelihood ratio test: chi2=3.4921  , p=0.7450  , df=6
parameter F test:         F=0.5381  , p=0.7785  , df_denom=138, df_num=6

In short, we don't find any causality above, though our intuition would have told us that something should exist. Perhaps there are variables other than temperature that impact power consumption that we have not thought of.

That is the power of data - commonly held conceptions can be challenged.


Forecasting with Simple Exponential Smoothing, Holt and Holt-Winters Method

(a) Simple Exponential Smoothing
In simple exponential smoothing, we reduce the time series to a single variable.

Simple exponential smoothing is suitable for forecasting data that has no clear trend or seasonal component. Obviously, this sort of data will have no pattern, so how do we forecast it? Consider two extreme approaches:
- Every future value will be equal to the average of all prior values,
- Every future value is the same as the last one.

The difference between the two extreme situations above is that in the first one, we weigh all past observations as equally important, and in the second, we give all the weight to the last observation and none to the ones prior to that.

The Simple Exponential Smoothing method takes an approach in between - it gives the most weight to the last observation, and gradually reduces the weight as we go further back in the past. It does so using a single parameter called alpha.

(Source: https://otexts.com/fpp2/ses.html)

(b) Holt's Method - Double Exponential Smoothing
Holt extended simple exponential smoothing described above to account for a trend. This is captured in a parameter called .

This method involves calculating a ‘level’, with the smoothing parameter α, as well as the trend using a smoothing parameter β. These parameters are used in a way similar to what we saw with EWMA.

Because we are using two parameters, it is called ‘double exponential smoothing’.

We can specify additive or multiplicative trends. Additive trends are preferred when the level of change over time is constant. Multiplicative trends make sense when the trend varies proportional to the current values of the series.

(c) Holt-Winters' Method - Triple Exponential Smoothing
The Holt-Winters’ seasonal method accounts for the level, as well as the trend and seasonality, with corresponding smoothing parameters α, β and γ respectively. - Level: α
- Trend: β
- Seasonality: γ

We also specify the frequency of the seasonality, i.e., the number of periods that comprise a season. For example, for quarterly data the frequency would be 4 , and for monthly data, it would be 12.

Like for trend, we can specify whether the seasonality is additive or multiplicative.

The equations for Triple Exponential Smoothing look as follows:

We will not cover these equations in detail as the code does everything for us. As practitioners, we need to think about the problems we can solve with this, and while being aware of the underlying logic. The code will calculate the values of , and and use these in the equations above to make predictions.

# Let us look at the index of our data frame that has the bicycle crossing data

new_df.index
DatetimeIndex(['2012-10-31', '2012-11-30', '2012-12-31', '2013-01-31',
               '2013-02-28', '2013-03-31', '2013-04-30', '2013-05-31',
               '2013-06-30', '2013-07-31', '2013-08-31', '2013-09-30',
               '2013-10-31', '2013-11-30', '2013-12-31', '2014-01-31',
               '2014-02-28', '2014-03-31', '2014-04-30', '2014-05-31',
               '2014-06-30', '2014-07-31', '2014-08-31', '2014-09-30',
               '2014-10-31', '2014-11-30', '2014-12-31', '2015-01-31',
               '2015-02-28', '2015-03-31', '2015-04-30', '2015-05-31',
               '2015-06-30', '2015-07-31', '2015-08-31', '2015-09-30',
               '2015-10-31', '2015-11-30', '2015-12-31', '2016-01-31',
               '2016-02-29', '2016-03-31', '2016-04-30', '2016-05-31',
               '2016-06-30', '2016-07-31', '2016-08-31', '2016-09-30',
               '2016-10-31', '2016-11-30', '2016-12-31', '2017-01-31',
               '2017-02-28', '2017-03-31', '2017-04-30', '2017-05-31',
               '2017-06-30', '2017-07-31', '2017-08-31', '2017-09-30',
               '2017-10-31', '2017-11-30', '2017-12-31', '2018-01-31',
               '2018-02-28', '2018-03-31', '2018-04-30', '2018-05-31',
               '2018-06-30', '2018-07-31', '2018-08-31', '2018-09-30',
               '2018-10-31', '2018-11-30', '2018-12-31', '2019-01-31',
               '2019-02-28', '2019-03-31', '2019-04-30', '2019-05-31',
               '2019-06-30', '2019-07-31', '2019-08-31', '2019-09-30',
               '2019-10-31', '2019-11-30'],
              dtype='datetime64[ns]', name='Date', freq='M')
# Clean up the data frame, set index frequency explicitly to Monthly
# This is needed as Holt-Winters will not work otherwise

new_df.index.freq = 'M'
# Let us drop an NaN entries, just in case

new_df.dropna(inplace=True)
new_df
Total Moving_Avg_6m Moving_Avg_12m EWMA12
Date
2013-09-30 80729.0 97184.000000 74734.583333 82750.448591
2013-10-31 81352.0 98743.000000 76039.333333 82535.302654
2013-11-30 59270.0 90525.666667 76757.916667 78956.025322
2013-12-31 43553.0 81237.833333 77356.583333 73509.406042
2014-01-31 59873.0 71554.333333 78605.666667 71411.497420
... ... ... ... ...
2019-07-31 137714.0 101472.833333 91343.750000 100389.020732
2019-08-31 142414.0 119192.000000 93894.166667 106854.402158
2019-09-30 112174.0 123644.833333 95221.833333 107672.801826
2019-10-31 104498.0 126405.833333 96348.166667 107184.370776
2019-11-30 84963.0 119045.833333 97725.833333 103765.698349

75 rows × 4 columns

# Set warnings to ignore so we don't get the ugly orange boxes

import warnings
warnings.filterwarnings('ignore')
# Some library imports 

from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_squared_error as mse

Simple Exponential Smoothing (same as EWMA)

# Train-test split
train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])
Training set:  60
Test set:  15
# Fit model using Simple Exponential Smoothing
model = SimpleExpSmoothing(train_set['Total']).fit()
predictions = model.forecast(15)
# let us plot the predictions and the training values
train_set['Total'].plot(legend=True,label='Training Set')
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(16,10))
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
# train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(14,10))
predictions.plot(legend=True,label='Model prediction');

png

model.params
{'smoothing_level': 0.995,
 'smoothing_trend': nan,
 'smoothing_seasonal': nan,
 'damping_trend': nan,
 'initial_level': 80729.0,
 'initial_trend': nan,
 'initial_seasons': array([], dtype=float64),
 'use_boxcox': False,
 'lamda': None,
 'remove_bias': False}
# Calculate Evaluation Metrics
y_test = test_set['Total']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})
y_test y_pred diff
2018-09-30 96242.0 111889.675227 -15647.675227
2018-10-31 90982.0 111889.675227 -20907.675227
2018-11-30 68431.0 111889.675227 -43458.675227
2018-12-31 46941.0 111889.675227 -64948.675227
2019-01-31 72883.0 111889.675227 -39006.675227
2019-02-28 36099.0 111889.675227 -75790.675227
2019-03-31 85457.0 111889.675227 -26432.675227
2019-04-30 87932.0 111889.675227 -23957.675227
2019-05-31 129123.0 111889.675227 17233.324773
2019-06-30 132512.0 111889.675227 20622.324773
2019-07-31 137714.0 111889.675227 25824.324773
2019-08-31 142414.0 111889.675227 30524.324773
2019-09-30 112174.0 111889.675227 284.324773
2019-10-31 104498.0 111889.675227 -7391.675227
2019-11-30 84963.0 111889.675227 -26926.675227
from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  1228535508.79919
RMSE =  35050.47087842316
MAE =  29263.82507577501

Double Exponential Smoothing

# Train-test split
train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])
Training set:  60
Test set:  15
# Fit model using Simple Exponential Smoothing
# model = SimpleExpSmoothing(train_set['Total']).fit()
# Double Exponential Smoothing
model = ExponentialSmoothing(train_set['Total'], trend='mul').fit()
predictions = model.forecast(15)
# let us plot the predictions and the training values
train_set['Total'].plot(legend=True,label='Training Set')
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(16,10))
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
# train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(14,10))
predictions.plot(legend=True,label='Model prediction');

png

model.params
{'smoothing_level': 0.995,
 'smoothing_trend': 0.04738095238095238,
 'smoothing_seasonal': nan,
 'damping_trend': nan,
 'initial_level': 51251.999999999985,
 'initial_trend': 1.084850613368525,
 'initial_seasons': array([], dtype=float64),
 'use_boxcox': False,
 'lamda': None,
 'remove_bias': False}
# Calculate Evaluation Metrics
y_test = test_set['Total']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})
y_test y_pred diff
2018-09-30 96242.0 117626.951449 -21384.951449
2018-10-31 90982.0 123616.042220 -32634.042220
2018-11-30 68431.0 129910.073380 -61479.073380
2018-12-31 46941.0 136524.571266 -89583.571266
2019-01-31 72883.0 143475.852752 -70592.852752
2019-02-28 36099.0 150781.065503 -114682.065503
2019-03-31 85457.0 158458.230275 -73001.230275
2019-04-30 87932.0 166526.285367 -78594.285367
2019-05-31 129123.0 175005.133341 -45882.133341
2019-06-30 132512.0 183915.690115 -51403.690115
2019-07-31 137714.0 193279.936564 -55565.936564
2019-08-31 142414.0 203120.972739 -60706.972739
2019-09-30 112174.0 213463.074854 -101289.074854
2019-10-31 104498.0 224331.755168 -119833.755168
2019-11-30 84963.0 235753.824924 -150790.824924
from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  6789777046.197141
RMSE =  82400.10343559734
MAE =  75161.63066121914

Triple Exponential Smoothing

# Train-test split
train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])
Training set:  60
Test set:  15
# Let us use the triple exponential smoothing model
model = ExponentialSmoothing(train_set['Total'],trend='add', \
                             seasonal='add',seasonal_periods=12).fit()
predictions = model.forecast(15)
# let us plot the predictions and the training values
train_set['Total'].plot(legend=True,label='Training Set')
predictions.plot(legend=True,label='Model prediction');

png

# Now we plot test (observed) values as well
train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(16,10))
predictions.plot(legend=True,label='Model prediction');

png

# Test vs observed - closeup of the predictions
# train_set['Total'].plot(legend=True,label='Training Set')
test_set['Total'].plot(legend=True,label='Test Set',figsize=(14,10))
predictions.plot(legend=True,label='Model prediction');

png

model.params
{'smoothing_level': 0.2525,
 'smoothing_trend': 0.0001,
 'smoothing_seasonal': 0.0001,
 'damping_trend': nan,
 'initial_level': 82880.52777777775,
 'initial_trend': 233.32297979798386,
 'initial_seasons': array([ 12706.58333333,  -1150.10416667, -23387.98958333, -38062.89583333,
        -27297.51041667, -29627.21875   , -15820.59375   ,   1298.15625   ,
         30509.6875    ,  28095.90625   ,  32583.70833333,  30152.27083333]),
 'use_boxcox': False,
 'lamda': None,
 'remove_bias': False}
# Calculate Evaluation Metrics
y_test = test_set['Total']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})
y_test y_pred diff
2018-09-30 96242.0 100914.043265 -4672.043265
2018-10-31 90982.0 87291.600234 3690.399766
2018-11-30 68431.0 65286.060329 3144.939671
2018-12-31 46941.0 50843.440823 -3902.440823
2019-01-31 72883.0 61841.794519 11041.205481
2019-02-28 36099.0 59743.301291 -23644.301291
2019-03-31 85457.0 73783.743291 11673.256709
2019-04-30 87932.0 91133.340351 -3201.340351
2019-05-31 129123.0 120579.559129 8543.440871
2019-06-30 132512.0 118396.387406 14115.612594
2019-07-31 137714.0 123117.725595 14596.274405
2019-08-31 142414.0 120917.981585 21496.018415
2019-09-30 112174.0 103703.284649 8470.715351
2019-10-31 104498.0 90080.841618 14417.158382
2019-11-30 84963.0 68075.301713 16887.698287
from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  160014279.71049073
RMSE =  12649.675083198412
MAE =  10899.78971069559
model = ExponentialSmoothing(new_df['Total'], trend='mul').fit()
# Fit values
pd.DataFrame({'fitted':model.fittedvalues.shift(-1), 'actual':new_df['Total']})
fitted actual
Date
2013-09-30 88374.142596 80729.0
2013-10-31 89066.318586 81352.0
2013-11-30 64512.628727 59270.0
2013-12-31 47037.321040 43553.0
2014-01-31 64853.079989 59873.0
... ... ...
2019-07-31 147204.031288 137714.0
2019-08-31 152114.454701 142414.0
2019-09-30 119265.035267 112174.0
2019-10-31 110660.795745 104498.0
2019-11-30 NaN 84963.0

75 rows × 2 columns

# Examine model parameters
model.params
{'smoothing_level': 0.995,
 'smoothing_trend': 0.02369047619047619,
 'smoothing_seasonal': nan,
 'damping_trend': nan,
 'initial_level': 51251.999999999985,
 'initial_trend': 1.084850613368525,
 'initial_seasons': array([], dtype=float64),
 'use_boxcox': False,
 'lamda': None,
 'remove_bias': False}
# RMSE calculation
x = pd.DataFrame({'fitted':model.fittedvalues.shift(-1), 'actual':new_df['Total']}).dropna()
rmse = np.sqrt(mse(x.fitted, x.actual))
rmse
6017.795365317929
# predict the next 15 values
predictions = model.forecast(15)
predictions
2019-12-31     89553.250900
2020-01-31     94248.964768
2020-02-29     99190.897823
2020-03-31    104391.960540
2020-04-30    109865.740351
2020-05-31    115626.537143
2020-06-30    121689.400618
2020-07-31    128070.169604
2020-08-31    134785.513440
2020-09-30    141852.975517
2020-10-31    149291.019112
2020-11-30    157119.075623
2020-12-31    165357.595328
2021-01-31    174028.100817
2021-02-28    183153.243211
Freq: M, dtype: float64
model.fittedvalues
Date
2013-09-30     55600.763636
2013-10-31     88374.142596
2013-11-30     89066.318586
2013-12-31     64512.628727
2014-01-31     47037.321040
                  ...      
2019-07-31    141747.388757
2019-08-31    147204.031288
2019-09-30    152114.454701
2019-10-31    119265.035267
2019-11-30    110660.795745
Freq: M, Length: 75, dtype: float64

ARIMA - Auto Regressive Integrated Moving Average

ARIMA stands for Auto Regressive Integrated Moving Average. It is a general method for understanding and predicting time series data.

ARIMA models come in several different flavors, for example: - Non-seasonal ARIMA
- Seasonal ARIMA (Called SARIMA)
- SARIMA with external variables, called SARIMAX
Which one should we use? ARIMA or Holt-Winters?

Try both. Whatever works better for your use case is the one to use.

ARIMA models have three non-negative integer parameters – p, d and q. - p represents the Auto-Regression component, AR. This is the part of the model that leverages the linear regression between an observation and past observations.
- d represents differencing, the I component. This is the number of times the series has to be differenced to make it stationary.
- q represents the MA component, the number of lagged forecast errors in the prediction. This considers the relationship between an observation and the residual error from a moving average model.

A correct choice of the ‘order’ of your ARIMA model, ie deciding the values of p, d and q, is essential to building a good ARIMA model.

Deciding the values of p, d and q
- Values of p and q can be determined manually by examining auto-correlation and partial-autocorrelation plots.
- The value of d can be determined by repeatedly differencing a series till we get to a stationary series.

The manual methods are time consuming, and less precise. An overview of these is provided in the Appendix to this slide deck.

In reality, we let the computer do a grid search (a brute force test of a set of permutations for p, d and q) to determine the order of our ARIMA model. The Pyramid ARIMA library in Python allows searching through multiple combinations of p, d and q to identify the best model.

ARIMA in Action
1. Split dataset into train and test.
2. Test set should be the last n entries. Can’t use random selection for train-test split.
3. Pyramid ARIMA is a Python package that can identify the values of p, d and q to use. Use Auto ARIMA from Pyramid ARIMA to find good values of p, d and q based on the training data set.
4. Fit a model on the training data set.
5. Predict the test set, and evaluate using MSE, MAE or RMSE.

# Library imports

from pmdarima import auto_arima
# Train-test split

train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])
Training set:  60
Test set:  15
# Clean up

train_set.dropna(inplace=True)

Use Auto ARIMA to find out order

# Build a model using auto_arima

model = auto_arima(train_set['Total'],seasonal=False)
order = model.get_params()['order']
seasonal_order = model.get_params()['seasonal_order']
print('Order = ', order)
print('Seasonal Order = ', seasonal_order)
Order =  (5, 0, 1)
Seasonal Order =  (0, 0, 0, 0)
# Create and fit model

from statsmodels.tsa.arima.model import ARIMA

model_ARIMA = ARIMA(train_set['Total'], order = order)
model_ARIMA = model_ARIMA.fit()
# Predict with ARIMA 

start=len(train_set)
end=len(train_set)+len(test_set)-1
ARIMApredictions = model_ARIMA.predict(start=start, end=end, dynamic=False, typ='levels').rename('ARIMA Predictions')
# model_ARIMA.summary()
# Calculate Evaluation Metrics
y_test = test_set['Total']
y_pred = ARIMApredictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})
y_test y_pred diff
2018-09-30 96242.0 97178.782841 -936.782841
2018-10-31 90982.0 68163.166776 22818.833224
2018-11-30 68431.0 56095.942557 12335.057443
2018-12-31 46941.0 41005.329654 5935.670346
2019-01-31 72883.0 42591.251732 30291.748268
2019-02-28 36099.0 51555.855575 -15456.855575
2019-03-31 85457.0 71734.836673 13722.163327
2019-04-30 87932.0 91198.670052 -3266.670052
2019-05-31 129123.0 110818.433668 18304.566332
2019-06-30 132512.0 121161.132303 11350.867697
2019-07-31 137714.0 122198.078813 15515.921187
2019-08-31 142414.0 111847.671126 30566.328874
2019-09-30 112174.0 94763.626480 17410.373520
2019-10-31 104498.0 73934.456312 30563.543688
2019-11-30 84963.0 56060.536900 28902.463100
# Calculate evaluation metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  385065542.1818603
RMSE =  19623.086968717747
MAE =  17158.523031611756
# Plot results
train_set['Total'].rename('Training Set').plot(legend=True)
test_set['Total'].rename('Test Set').plot(legend=True)
ARIMApredictions.plot(legend=True)
plt.show()

png


Seasonal ARIMA - SARIMA

SARIMA, or Seasonal ARIMA, accounts for seasonality. In order to account for seasonality, we need three more parameters – P, D and Q – to take care of seasonal variations.

Auto ARIMA takes care of seasonality as well, and provides us the values for P, D and Q just like it does for p, d and q.

To use SARIMA, we now need 7 parameters for our function:
- p
- d
- q
- P
- D
- Q
- m (frequency of our seasons, eg, 12)

SARIMA in Action
1. Split dataset into train and test.
2. Test set should be the last n entries. Can’t use random selection for train-test split.
3. Pyramid ARIMA is a Python package that can identify the values of p, d, q, P, D and Q to use. Use Auto ARIMA from Pyramid ARIMA to find good values of p, d and q based on the training data set.
4. Fit a model on the training data set.
5. Predict the test set, and evaluate using MSE, MAE or RMSE.

# Create a model with auto_arima

model = auto_arima(train_set['Total'],seasonal=True,m=12)
# Get values of p, d, q, P, D and Q

order = model.get_params()['order']
seasonal_order = model.get_params()['seasonal_order']

print('Order = ', order)
print('Seasonal Order = ', seasonal_order)
Order =  (1, 0, 0)
Seasonal Order =  (0, 1, 1, 12)
# Create and fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX
model_SARIMA = SARIMAX(train_set['Total'], order=order, seasonal_order=seasonal_order)
model_SARIMA = model_SARIMA.fit()
model_SARIMA.params
ar.L1       2.438118e-01
ma.S.L12   -6.668071e-02
sigma2      7.885998e+07
dtype: float64
# Create SARIMA predictions  

start=len(train_set)
end=len(train_set)+len(test_set)-1
SARIMApredictions = model_SARIMA.predict(start=start, end=end, dynamic=False, typ='levels').rename('SARIMA Predictions')
# Calculate Evaluation Metrics

y_test = test_set['Total']
y_pred = SARIMApredictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})
y_test y_pred diff
2018-09-30 96242.0 94423.659176 1818.340824
2018-10-31 90982.0 86518.880961 4463.119039
2018-11-30 68431.0 57965.336973 10465.663027
2018-12-31 46941.0 45396.267772 1544.732228
2019-01-31 72883.0 58009.524643 14873.475357
2019-02-28 36099.0 50177.757219 -14078.757219
2019-03-31 85457.0 76096.865509 9360.134491
2019-04-30 87932.0 79286.971453 8645.028547
2019-05-31 129123.0 128451.795674 671.204326
2019-06-30 132512.0 112789.442695 19722.557305
2019-07-31 137714.0 127353.588372 10360.411628
2019-08-31 142414.0 112330.356235 30083.643765
2019-09-30 112174.0 94550.771990 17623.228010
2019-10-31 104498.0 86549.872568 17948.127432
2019-11-30 84963.0 57972.893093 26990.106907
# Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  231993016.88445473
RMSE =  15231.316978004716
MAE =  12576.56867360885
# Plot results

train_set['Total'].rename('Training Set').plot(legend=True)
test_set['Total'].rename('Test Set').plot(legend=True)
SARIMApredictions.plot(legend = True)
ARIMApredictions.plot(legend=True)
plt.show()

png

# Models compared to each other - calculate MAE, RMSE

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
print('SARIMA:')
print('  RMSE = ' ,mse(SARIMApredictions, test_set['Total'], squared = False))
print('  MAE = ', mae(SARIMApredictions, test_set['Total']))
print('\nARIMA:')
print('  RMSE = ' ,mse(ARIMApredictions, test_set['Total'], squared = False))
print('  MAE = ', mae(ARIMApredictions, test_set['Total']))
print('\n')
print('  Mean of the data = ', new_df.Total.mean())
print('  St Dev of the data = ', new_df.Total.std())
SARIMA:
  RMSE =  15231.316978004716
  MAE =  12576.56867360885

ARIMA:
  RMSE =  19623.086968717747
  MAE =  17158.523031611756


  Mean of the data =  85078.8
  St Dev of the data =  28012.270090473237

SARIMAX

SARIMAX = Seasonal ARIMA with eXogenous variable

SARIMAX is the same as SARIMA, but there is an additional predictor variable in addition to just the time series itself.

Let us load some data showing weekly power consumption as well as average daily temperature. We will try to predict kwh as a time series, and also use Temp_avg as an exogenous variable.

In order to predict the future, you need the past data series, plus observed values for the exogenous variable. That can sometimes be difficult because the future may not yet have revealed itself yet, and while you may be able to build a model that evaluates well, you will not be able to use it.

# Data adapted from: 
# https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries  
# The data shows weekly electricity usage and the average temperature of the week.  
# Our hypothesis is that the power consumed can be predicted using the average 
# temperature, and the pattern found in the time series.

df_elec = pd.read_csv('pwr_usage.csv')
df_elec
Date Temp_avg kwh
0 1/8/2017 75.542857 106.549
1 1/15/2017 71.014286 129.096
2 1/22/2017 64.414286 68.770
3 1/29/2017 56.728571 71.378
4 2/5/2017 66.128571 107.829
... ... ... ...
152 12/8/2019 74.371429 167.481
153 12/15/2019 61.242857 86.248
154 12/22/2019 50.000000 73.206
155 12/29/2019 60.128571 35.655
156 1/5/2020 7.185714 4.947

157 rows × 3 columns

Data exploration

df_elec.index = pd.DatetimeIndex(df_elec.Date, freq='W-SUN')
df_elec.drop(['Date'], axis = 1, inplace = True)
df_elec
Temp_avg kwh
Date
2017-01-08 75.542857 106.549
2017-01-15 71.014286 129.096
2017-01-22 64.414286 68.770
2017-01-29 56.728571 71.378
2017-02-05 66.128571 107.829
... ... ...
2019-12-08 74.371429 167.481
2019-12-15 61.242857 86.248
2019-12-22 50.000000 73.206
2019-12-29 60.128571 35.655
2020-01-05 7.185714 4.947

157 rows × 2 columns

df_elec.index
DatetimeIndex(['2017-01-08', '2017-01-15', '2017-01-22', '2017-01-29',
               '2017-02-05', '2017-02-12', '2017-02-19', '2017-02-26',
               '2017-03-05', '2017-03-12',
               ...
               '2019-11-03', '2019-11-10', '2019-11-17', '2019-11-24',
               '2019-12-01', '2019-12-08', '2019-12-15', '2019-12-22',
               '2019-12-29', '2020-01-05'],
              dtype='datetime64[ns]', name='Date', length=157, freq='W-SUN')
df_elec['kwh'][:60].plot()
<Axes: xlabel='Date'>

png


plt.rc("figure", figsize=(18,4))
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plot_acf(df_elec['kwh']);
plot_pacf(df_elec['kwh']);

png

png

result = seasonal_decompose(df_elec['kwh'], model = 'additive') 
plt.rcParams['figure.figsize'] = (20, 9)
result.plot();

png

# Plot the months to see trends over months
month_plot(df_elec[['kwh']].resample(rule='M').kwh.sum());

png

# Plot the quarter to see trends over quarters
quarter_plot(df_elec[['kwh']].resample(rule='Q').kwh.sum());

png

Train-test split

# Train-test split
test_samples = 12

train_set = df_elec.iloc[:-test_samples]
test_set = df_elec.iloc[-test_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])
Training set:  145
Test set:  12
train_set
Temp_avg kwh
Date
2017-01-08 75.542857 106.5490
2017-01-15 71.014286 129.0960
2017-01-22 64.414286 68.7700
2017-01-29 56.728571 71.3780
2017-02-05 66.128571 107.8290
... ... ...
2019-09-15 76.800000 168.3470
2019-09-22 79.385714 157.7260
2019-09-29 80.928571 191.8260
2019-10-06 70.771429 137.1698
2019-10-13 74.285714 147.5200

145 rows × 2 columns

test_set
Temp_avg kwh
Date
2019-10-20 73.471429 137.847
2019-10-27 64.557143 84.999
2019-11-03 62.928571 76.744
2019-11-10 77.985714 175.708
2019-11-17 49.042857 77.927
2019-11-24 62.785714 62.805
2019-12-01 69.557143 74.079
2019-12-08 74.371429 167.481
2019-12-15 61.242857 86.248
2019-12-22 50.000000 73.206
2019-12-29 60.128571 35.655
2020-01-05 7.185714 4.947

Uncomment this cell to run auto-ARIMA (very time consuming)
Determine parameters for SARIMAX using Auto-ARIMA

model = auto_arima(train_set['kwh'],seasonal=True,m=52)  
order = model.get_params()['order']  
seasonal_order = model.get_params()['seasonal_order']  

print('Order = ', order)  
print('Seasonal Order = ', seasonal_order)  

# Set the order, ie the values of p, d, q and P, D, Q and m.

order =  (1, 0, 1)
seasonal_order =  (1, 0, 1, 52)

First, let us try SARIMA, ignoring temperature

# Create and fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX
model_SARIMA = SARIMAX(train_set['kwh'],order=order,seasonal_order=seasonal_order)
model_SARIMA = model_SARIMA.fit()
# model_SARIMA.summary()
model_SARIMA.params
ar.L1         0.983883
ma.L1        -0.703757
ar.S.L52      0.994796
ma.S.L52     -0.842830
sigma2      781.564879
dtype: float64
# Create SARIMA predictions

start=len(train_set)
end=len(train_set)+len(test_set)-1
SARIMApredictions = model_SARIMA.predict(start=start, end=end, dynamic=False, typ='levels').rename('SARIMA Predictions')
# Calculate Evaluation Metrics

y_test = test_set['kwh']
y_pred = SARIMApredictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})
y_test y_pred diff
2019-10-20 137.847 108.432110 29.414890
2019-10-27 84.999 84.006054 0.992946
2019-11-03 76.744 99.242296 -22.498296
2019-11-10 175.708 164.413570 11.294430
2019-11-17 77.927 92.077521 -14.150521
2019-11-24 62.805 74.425502 -11.620502
2019-12-01 74.079 81.382111 -7.303111
2019-12-08 167.481 164.427999 3.053001
2019-12-15 86.248 95.116618 -8.868618
2019-12-22 73.206 65.763267 7.442733
2019-12-29 35.655 81.148805 -45.493805
2020-01-05 4.947 113.466587 -108.519587
# Model evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  1323.1768701198491
RMSE =  36.375498211293944
MAE =  22.55437004082181
# Plot results

train_set['kwh'].rename('Training Set').plot(legend=True)
test_set['kwh'].rename('Test Set').plot(legend=True)
SARIMApredictions.plot(legend = True)
plt.show()

png

y_test
Date
2019-10-20    137.847
2019-10-27     84.999
2019-11-03     76.744
2019-11-10    175.708
2019-11-17     77.927
2019-11-24     62.805
2019-12-01     74.079
2019-12-08    167.481
2019-12-15     86.248
2019-12-22     73.206
2019-12-29     35.655
2020-01-05      4.947
Freq: W-SUN, Name: kwh, dtype: float64

Let us use SARIMAX - Seasonal ARIMA with eXogenous Variable

model = SARIMAX(endog=train_set['kwh'],exog=train_set['Temp_avg'],order=(1,0,0),seasonal_order=(2,0,0,7),enforce_invertibility=False)
results = model.fit()
results.summary()
SARIMAX Results
Dep. Variable: kwh No. Observations: 145
Model: SARIMAX(1, 0, 0)x(2, 0, 0, 7) Log Likelihood -732.174
Date: Fri, 10 Nov 2023 AIC 1474.348
Time: 22:42:32 BIC 1489.232
Sample: 01-08-2017 HQIC 1480.396
- 10-13-2019
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
Temp_avg 2.1916 0.096 22.781 0.000 2.003 2.380
ar.L1 0.6524 0.064 10.202 0.000 0.527 0.778
ar.S.L7 -0.2007 0.089 -2.261 0.024 -0.375 -0.027
ar.S.L14 -0.0945 0.090 -1.048 0.294 -0.271 0.082
sigma2 1414.9974 193.794 7.302 0.000 1035.169 1794.826
Ljung-Box (L1) (Q): 0.20 Jarque-Bera (JB): 2.42
Prob(Q): 0.65 Prob(JB): 0.30
Heteroskedasticity (H): 1.42 Skew: 0.30
Prob(H) (two-sided): 0.23 Kurtosis: 2.80

Warnings:[1] Covariance matrix calculated using the outer product of gradients (complex-step).

# Create and fit model

from statsmodels.tsa.statespace.sarimax import SARIMAX
model_SARIMAX = SARIMAX(train_set['kwh'], exog = train_set['Temp_avg'], order=order,seasonal_order=seasonal_order)
model_SARIMAX = model_SARIMAX.fit()
# model_SARIMA.summary()
model_SARIMAX.params
Temp_avg      4.494951
ar.L1         0.994501
ma.L1        -0.672744
ar.S.L52      0.996258
ma.S.L52     -0.936537
sigma2      680.303411
dtype: float64

# Create SARIMAX predictions

exog = test_set[['Temp_avg']]
start=len(train_set)
end=len(train_set)+len(test_set)-1
SARIMAXpredictions = model_SARIMAX.predict(start=start, end=end, exog = exog, dynamic=False, typ='levels').rename('SARIMAX Predictions')
# Calculate Evaluation Metrics 

y_test = test_set['kwh']
y_pred = SARIMAXpredictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})




y_test y_pred diff
2019-10-20 137.847 135.762630 2.084370
2019-10-27 84.999 90.884021 -5.885021
2019-11-03 76.744 81.577953 -4.833953
2019-11-10 175.708 172.860135 2.847865
2019-11-17 77.927 35.830430 42.096570
2019-11-24 62.805 89.152707 -26.347707
2019-12-01 74.079 120.706976 -46.627976
2019-12-08 167.481 150.329857 17.151143
2019-12-15 86.248 100.167290 -13.919290
2019-12-22 73.206 22.178695 51.027305
2019-12-29 35.655 96.057753 -60.402753
2020-01-05 4.947 -156.713632 161.660632

# Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  3132.1077834193397
RMSE =  55.96523727653926
MAE =  36.240382161713455
# Plot results

train_set['kwh'].rename('Training Set').plot(legend=True)
test_set['kwh'].rename('Test Set').plot(legend=True)
SARIMAXpredictions.plot(legend = True)
plt.show()

png

y_test
Date
2019-10-20    137.847
2019-10-27     84.999
2019-11-03     76.744
2019-11-10    175.708
2019-11-17     77.927
2019-11-24     62.805
2019-12-01     74.079
2019-12-08    167.481
2019-12-15     86.248
2019-12-22     73.206
2019-12-29     35.655
2020-01-05      4.947
Freq: W-SUN, Name: kwh, dtype: float64

FB Prophet

https://facebook.github.io/prophet/

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.

It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Prophet is superior to other methods as it can detect and account for multiple layers of seasonality in the same dataset. (What does this mean: Data may have hourly, weekly as well as monthly seasonality, all distinct from each other. Other methods will calculate seasonality at the granularity level of the data, ie, to get monthly seasonality you need to provide monthly data using resampling.)

# Python
import pandas as pd
from prophet import Prophet

# Train-test split
train_samples = int(new_df.shape[0] * 0.8)

train_set = new_df.iloc[:train_samples]
test_set = new_df.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])
Training set:  60
Test set:  15
train_set.head()
Total Moving_Avg_6m Moving_Avg_12m EWMA12
Date
2013-09-30 80729.0 97184.000000 74734.583333 82750.448591
2013-10-31 81352.0 98743.000000 76039.333333 82535.302654
2013-11-30 59270.0 90525.666667 76757.916667 78956.025322
2013-12-31 43553.0 81237.833333 77356.583333 73509.406042
2014-01-31 59873.0 71554.333333 78605.666667 71411.497420
# Create the ds and y columns for Prophet
train_set_prophet = train_set.reset_index()
train_set_prophet = train_set_prophet[['Date', 'Total']]
train_set_prophet.columns = ['ds', 'y']
train_set_prophet.head()
ds y
0 2013-09-30 80729.0
1 2013-10-31 81352.0
2 2013-11-30 59270.0
3 2013-12-31 43553.0
4 2014-01-31 59873.0
model = Prophet()
model.fit(train_set_prophet)
22:42:37 - cmdstanpy - INFO - Chain [1] start processing
22:42:37 - cmdstanpy - INFO - Chain [1] done processing





<prophet.forecaster.Prophet at 0x14408a54c10>
future = model.make_future_dataframe(periods=15,freq = 'm')
future.tail()
ds
70 2019-07-31
71 2019-08-31
72 2019-09-30
73 2019-10-31
74 2019-11-30
future.shape
(75, 1)
# Python
forecast = model.predict(future)
forecast.columns
Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
       'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
       'yearly', 'yearly_lower', 'yearly_upper', 'multiplicative_terms',
       'multiplicative_terms_lower', 'multiplicative_terms_upper', 'yhat'],
      dtype='object')
forecast.head()
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
0 2013-09-30 83323.589562 85819.953093 99008.026474 83323.589562 83323.589562 9120.391098 9120.391098 9120.391098 9120.391098 9120.391098 9120.391098 0.0 0.0 0.0 92443.980660
1 2013-10-31 83286.808660 73361.637478 86911.006692 83286.808660 83286.808660 -2989.685567 -2989.685567 -2989.685567 -2989.685567 -2989.685567 -2989.685567 0.0 0.0 0.0 80297.123093
2 2013-11-30 83251.214239 53784.220687 66821.426791 83251.214239 83251.214239 -23011.327489 -23011.327489 -23011.327489 -23011.327489 -23011.327489 -23011.327489 0.0 0.0 0.0 60239.886750
3 2013-12-31 83214.433337 36997.798701 50727.126628 83214.433337 83214.433337 -39468.043632 -39468.043632 -39468.043632 -39468.043632 -39468.043632 -39468.043632 0.0 0.0 0.0 43746.389705
4 2014-01-31 83177.652434 49717.718616 62756.802402 83177.652434 83177.652434 -26943.934360 -26943.934360 -26943.934360 -26943.934360 -26943.934360 -26943.934360 0.0 0.0 0.0 56233.718075
preds = pd.DataFrame({'Prediction': forecast.yhat[-15:]})
preds.index = pd.to_datetime(forecast.ds[-15:])
preds.index.names = ['Date']
preds
Prediction
Date
2018-09-30 96887.285693
2018-10-31 88140.126499
2018-11-30 62980.805485
2018-12-31 50953.542730
2019-01-31 62367.352742
2019-02-28 56888.480539
2019-03-31 76208.358529
2019-04-30 86669.010005
2019-05-31 122794.851343
2019-06-30 120372.674873
2019-07-31 128458.704108
2019-08-31 114207.558640
2019-09-30 101186.010180
2019-10-31 95354.181624
2019-11-30 64717.364105
# Calculate Evaluation Metrics

y_test = test_set['Total'] 
y_pred = preds['Prediction']
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})
y_test y_pred diff
Date
2018-09-30 96242.0 96887.285693 -645.285693
2018-10-31 90982.0 88140.126499 2841.873501
2018-11-30 68431.0 62980.805485 5450.194515
2018-12-31 46941.0 50953.542730 -4012.542730
2019-01-31 72883.0 62367.352742 10515.647258
2019-02-28 36099.0 56888.480539 -20789.480539
2019-03-31 85457.0 76208.358529 9248.641471
2019-04-30 87932.0 86669.010005 1262.989995
2019-05-31 129123.0 122794.851343 6328.148657
2019-06-30 132512.0 120372.674873 12139.325127
2019-07-31 137714.0 128458.704108 9255.295892
2019-08-31 142414.0 114207.558640 28206.441360
2019-09-30 112174.0 101186.010180 10987.989820
2019-10-31 104498.0 95354.181624 9143.818376
2019-11-30 84963.0 64717.364105 20245.635895
# Model evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  157807682.19504696
RMSE =  12562.15276913344
MAE =  10071.55405527476
# Plot results

train_set['Total'].rename('Training Set').plot(legend=True)
test_set['Total'].rename('Test Set').plot(legend=True)
preds.Prediction.plot(legend = True)
plt.show()

png

model.plot_components(forecast);

png

Modeling Daily Data with Prophet

df = pd.read_csv('https://data.seattle.gov/api/views/65db-xm6k/rows.csv')
df
Date Fremont Bridge Sidewalks, south of N 34th St Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk
0 08/01/2022 12:00:00 AM 23.0 7.0 16.0
1 08/01/2022 01:00:00 AM 12.0 5.0 7.0
2 08/01/2022 02:00:00 AM 3.0 0.0 3.0
3 08/01/2022 03:00:00 AM 5.0 2.0 3.0
4 08/01/2022 04:00:00 AM 10.0 2.0 8.0
... ... ... ... ...
95635 08/31/2023 07:00:00 PM 224.0 72.0 152.0
95636 08/31/2023 08:00:00 PM 142.0 59.0 83.0
95637 08/31/2023 09:00:00 PM 67.0 35.0 32.0
95638 08/31/2023 10:00:00 PM 43.0 18.0 25.0
95639 08/31/2023 11:00:00 PM 12.0 8.0 4.0

95640 rows × 4 columns

# Rename the columns to make them simpler to use
df.columns = ['Date', 'Total', 'East', 'West']
# Create the Date column
df['Date'] = pd.DatetimeIndex(df.Date)
df.head()
Date Total East West
0 2022-08-01 00:00:00 23.0 7.0 16.0
1 2022-08-01 01:00:00 12.0 5.0 7.0
2 2022-08-01 02:00:00 3.0 0.0 3.0
3 2022-08-01 03:00:00 5.0 2.0 3.0
4 2022-08-01 04:00:00 10.0 2.0 8.0
# Create the ds and y columns for Prophet

df2 = df[['Date', 'Total']]
df2.columns = ['ds', 'y']
df2
ds y
0 2022-08-01 00:00:00 23.0
1 2022-08-01 01:00:00 12.0
2 2022-08-01 02:00:00 3.0
3 2022-08-01 03:00:00 5.0
4 2022-08-01 04:00:00 10.0
... ... ...
95635 2023-08-31 19:00:00 224.0
95636 2023-08-31 20:00:00 142.0
95637 2023-08-31 21:00:00 67.0
95638 2023-08-31 22:00:00 43.0
95639 2023-08-31 23:00:00 12.0

95640 rows × 2 columns

# Remove post-Covid data 

df_precovid = df2[df2.ds < pd.to_datetime('2019-12-31')]
%%time
model = Prophet()
model.fit(df_precovid)
22:42:53 - cmdstanpy - INFO - Chain [1] start processing
22:43:07 - cmdstanpy - INFO - Chain [1] done processing


CPU times: total: 1.7 s
Wall time: 19.3 s





<prophet.forecaster.Prophet at 0x144193be390>
future = model.make_future_dataframe(periods=365)
future.tail()
ds
63829 2020-12-25 23:00:00
63830 2020-12-26 23:00:00
63831 2020-12-27 23:00:00
63832 2020-12-28 23:00:00
63833 2020-12-29 23:00:00
type(future)
pandas.core.frame.DataFrame
future.shape
(63834, 1)
%%time
forecast = model.predict(future)
CPU times: total: 2.3 s
Wall time: 8.74 s
type(forecast)
pandas.core.frame.DataFrame
forecast.shape
(63834, 22)
forecast.columns
Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
       'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
       'daily', 'daily_lower', 'daily_upper', 'weekly', 'weekly_lower',
       'weekly_upper', 'yearly', 'yearly_lower', 'yearly_upper',
       'multiplicative_terms', 'multiplicative_terms_lower',
       'multiplicative_terms_upper', 'yhat'],
      dtype='object')
# forecast.to_excel('forecast.xlsx')
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
ds yhat yhat_lower yhat_upper
63829 2020-12-25 23:00:00 -16.020172 -142.659821 105.518723
63830 2020-12-26 23:00:00 -67.037196 -191.451418 42.315426
63831 2020-12-27 23:00:00 -22.832529 -145.837458 87.747590
63832 2020-12-28 23:00:00 27.544487 -85.548906 145.626637
63833 2020-12-29 23:00:00 26.487504 -87.689924 157.621258
# Python
fig1 = model.plot(forecast, include_legend=True)

png

# Python
fig2 = model.plot_components(forecast)

png


Deep Learning - Timeseries prediction using RNNs

Prediction with a deep neural network:
1. Split the data into train and test.
2. Decide how many time periods to use for prediction, and divide the data into and using Timeseries.Generator from Tensorflow.
3. Create a model and predict the first observation.
4. Then predict subsequent observations after including the earlier prediction.
5. Repeat to get as many prediction as you need (typically equal to the test set size). Evaluate the model.

# We use the same train and test sets that were created in the previous example
# Let us look at the dates in the train_set.  

train_set.index
DatetimeIndex(['2013-09-30', '2013-10-31', '2013-11-30', '2013-12-31',
               '2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30',
               '2014-05-31', '2014-06-30', '2014-07-31', '2014-08-31',
               '2014-09-30', '2014-10-31', '2014-11-30', '2014-12-31',
               '2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30',
               '2015-05-31', '2015-06-30', '2015-07-31', '2015-08-31',
               '2015-09-30', '2015-10-31', '2015-11-30', '2015-12-31',
               '2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30',
               '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31',
               '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31',
               '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30',
               '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31',
               '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31',
               '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31'],
              dtype='datetime64[ns]', name='Date', freq='M')
train_set = train_set[['Total']]
test_set = test_set[['Total']]
# We will do standard scaling as we are using a neural net

from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
# IGNORE WARNING ITS JUST CONVERTING TO FLOATS
# WE ONLY FIT TO TRAININ DATA, OTHERWISE WE ARE CHEATING ASSUMING INFO ABOUT TEST SET

std_scaler.fit(train_set)
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Now we transform using the standard scaler instantiated above

std_train = std_scaler.transform(train_set)
std_test = std_scaler.transform(test_set)
std_train.shape
(60, 1)
std_test.shape
(15, 1)

Create Sequences

from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

# define generator
lag = 12
batch_size = 1
sequences = TimeseriesGenerator(std_train, std_train, length=lag, batch_size=batch_size)
# Let us see what our standard scaled training data looks like.

std_train
array([[-0.06864305],
       [-0.0450607 ],
       [-0.88092804],
       [-1.47586179],
       [-0.85810276],
       [-1.34443658],
       [-0.72103747],
       [ 0.16324371],
       [ 1.36654898],
       [ 1.07368123],
       [ 1.44320106],
       [ 1.13360234],
       [ 0.56838311],
       [ 0.02428578],
       [-0.96723261],
       [-1.28833861],
       [-0.82944812],
       [-0.90405615],
       [-0.43146292],
       [ 0.04370431],
       [ 0.955126  ],
       [ 1.18004783],
       [ 1.14457968],
       [ 0.78766485],
       [ 0.32544331],
       [ 0.01743441],
       [-0.97942124],
       [-1.45924438],
       [-1.16622522],
       [-0.83887349],
       [-0.48218578],
       [ 0.42003766],
       [ 1.1967788 ],
       [ 0.94914525],
       [ 0.87593777],
       [ 1.12943852],
       [ 0.43964545],
       [-0.47919541],
       [-0.69821218],
       [-1.6505907 ],
       [-1.23920557],
       [-1.53460946],
       [-0.9007251 ],
       [-0.53483914],
       [ 1.00486469],
       [ 0.95611018],
       [ 1.37639073],
       [ 1.42499383],
       [ 0.52825905],
       [ 0.21199822],
       [-0.94096271],
       [-1.38845949],
       [-0.90663015],
       [-1.20619786],
       [-0.19904623],
       [-0.098244  ],
       [ 1.78932782],
       [ 1.15839598],
       [ 1.72138189],
       [ 1.10782453]])
len(std_train)
60
len(sequences) # n_input = 2
48
# What does the first batch look like?
X,y = sequences[0]
X
array([[[-0.06864305],
        [-0.0450607 ],
        [-0.88092804],
        [-1.47586179],
        [-0.85810276],
        [-1.34443658],
        [-0.72103747],
        [ 0.16324371],
        [ 1.36654898],
        [ 1.07368123],
        [ 1.44320106],
        [ 1.13360234]]])
y
array([[0.56838311]])
# What does the second batch look like?
X,y = sequences[1]
X
array([[[-0.0450607 ],
        [-0.88092804],
        [-1.47586179],
        [-0.85810276],
        [-1.34443658],
        [-0.72103747],
        [ 0.16324371],
        [ 1.36654898],
        [ 1.07368123],
        [ 1.44320106],
        [ 1.13360234],
        [ 0.56838311]]])
y
array([[0.02428578]])

Create the Model

# First, some library imports

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
import tensorflow as tf
# We will go with a very simple architecture, a sequential model 
# with just one hidden LSTM layer. Define model:  

model = Sequential()
model.add(LSTM(100, input_shape=(lag, batch_size)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm (LSTM)                 (None, 100)               40800

 dense (Dense)               (None, 1)                 101

=================================================================
Total params: 40901 (159.77 KB)
Trainable params: 40901 (159.77 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
# fit model

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=4)
history = model.fit_generator(sequences,epochs=50, callbacks=[callback])
Epoch 1/50
48/48 [==============================] - 2s 7ms/step - loss: 0.5581
Epoch 2/50
48/48 [==============================] - 0s 7ms/step - loss: 0.2086
Epoch 3/50
48/48 [==============================] - 0s 5ms/step - loss: 0.2686
Epoch 4/50
48/48 [==============================] - 0s 5ms/step - loss: 0.1574
Epoch 5/50
48/48 [==============================] - 0s 4ms/step - loss: 0.1408
Epoch 6/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1744
Epoch 7/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1421
Epoch 8/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1696
Epoch 9/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1318
Epoch 10/50
48/48 [==============================] - 0s 4ms/step - loss: 0.1321
Epoch 11/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1318
Epoch 12/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1423
Epoch 13/50
48/48 [==============================] - 0s 5ms/step - loss: 0.1208
Epoch 14/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1472
Epoch 15/50
48/48 [==============================] - 0s 4ms/step - loss: 0.1156
Epoch 16/50
48/48 [==============================] - 0s 5ms/step - loss: 0.1262
Epoch 17/50
48/48 [==============================] - 0s 5ms/step - loss: 0.1403
Epoch 18/50
48/48 [==============================] - 0s 6ms/step - loss: 0.1193
Epoch 19/50
48/48 [==============================] - 0s 7ms/step - loss: 0.1363
loss_per_epoch = model.history.history['loss']
plt.plot(range(len(loss_per_epoch)),loss_per_epoch)
[<matplotlib.lines.Line2D at 0x14490d21310>]

png

Evaluate on Test Data

As part of our training, we used the preceding 12 values to predict the next one. Now we will use the last 12 values of the training data to predict the first value of the test data. Then we will add this predicted value to our sequence, and predict the next one, and so on.

lag
12
# Pick the last 12 values in the training data

predictors = std_train[-lag:]
# See what they look like...

predictors
array([[ 0.52825905],
       [ 0.21199822],
       [-0.94096271],
       [-1.38845949],
       [-0.90663015],
       [-1.20619786],
       [-0.19904623],
       [-0.098244  ],
       [ 1.78932782],
       [ 1.15839598],
       [ 1.72138189],
       [ 1.10782453]])

We will need to reshape this data to feed into our LSTM layer which expects a 3 dimensional input. We do that next.

predictors = predictors.reshape((1, lag, 1))

Next, we perform the prediction to get the first item after the training data has ended. This prediction will come in as standard-scaled, so we will need to reverse out the scaling to get the true prediction.

# Predict 

x = model.predict(predictors)
x
1/1 [==============================] - 0s 414ms/step





array([[0.45122966]], dtype=float32)
# Do an inverse transform to get the actual prediction

std_scaler.inverse_transform(x)
array([[94463.03]], dtype=float32)
# Let us see what the actual value in the test set was.

test_set.iloc[1]
Total    90982.0
Name: 2018-10-31 00:00:00, dtype: float64

Now we can predict the next data point. However, doing this manually is going to be very tedious, so we write some code to loop through this and do it for as many times as we need the predictions for.

predictors = std_train[-lag:].reshape((1, lag, 1))
predictions = []
for i in range(len(std_test)):
    next_pred = model.predict(predictors)[0]
    predictions.append(next_pred) 
    predictors = np.append(predictors[:,1:,:],[[next_pred]],axis=1)

predictions
1/1 [==============================] - 0s 22ms/step
1/1 [==============================] - 0s 27ms/step
1/1 [==============================] - 0s 30ms/step
1/1 [==============================] - 0s 26ms/step
1/1 [==============================] - 0s 35ms/step
1/1 [==============================] - 0s 27ms/step
1/1 [==============================] - 0s 32ms/step
1/1 [==============================] - 0s 29ms/step
1/1 [==============================] - 0s 30ms/step
1/1 [==============================] - 0s 28ms/step
1/1 [==============================] - 0s 33ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 30ms/step
1/1 [==============================] - 0s 31ms/step
1/1 [==============================] - 0s 25ms/step





[array([0.45122966], dtype=float32),
 array([-0.3290154], dtype=float32),
 array([-1.0449202], dtype=float32),
 array([-1.4312159], dtype=float32),
 array([-1.4465744], dtype=float32),
 array([-1.1398427], dtype=float32),
 array([-0.48234788], dtype=float32),
 array([0.2781088], dtype=float32),
 array([1.047395], dtype=float32),
 array([1.4374696], dtype=float32),
 array([1.5440987], dtype=float32),
 array([1.1877004], dtype=float32),
 array([0.53592247], dtype=float32),
 array([-0.24500774], dtype=float32),
 array([-0.94227564], dtype=float32)]
# We inverse transform these scaled predictions

final_predictions = std_scaler.inverse_transform(predictions)
final_predictions
array([[ 94463.03240163],
       [ 73850.4654616 ],
       [ 54937.64395903],
       [ 44732.45864597],
       [ 44326.71497745],
       [ 52429.97380651],
       [ 69799.71784522],
       [ 89889.51390621],
       [110212.56840795],
       [120517.58586778],
       [123334.52152427],
       [113919.16595602],
       [ 96700.45268135],
       [ 76069.78565349],
       [ 57649.31496236]])
# Now let us plot the actuals versus the predicted values

pd.DataFrame({'actual': test_set.Total, 'forecast': final_predictions.flatten()}).plot()
<Axes: xlabel='Date'>

png

# Next, we evaluate our model using some metrics

y_test = test_set.Total
y_pred = final_predictions.flatten()

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  341904917.8754553
RMSE =  18490.671103977144
MAE =  16140.603957905736

Which method performed best?

Here is a summary of model performance above. The numbers would change each time you run it, but should give us a picture.

EWMA
MSE = 1228535508.79919
RMSE = 35050.47087842316
MAE = 29263.82507577501

DES
MSE = 6789777046.197141
RMSE = 82400.10343559734
MAE = 75161.63066121914

TES
MSE = 160014279.71049073
RMSE = 12649.675083198412
MAE = 10899.78971069559

ARIMA no seasonality
MSE = 385065540.80112064
RMSE = 19623.086933536237
MAE = 17158.523051864664

SARIMA
MSE = 231993016.8844547
RMSE = 15231.316978004716
MAE = 12576.568673608848

Prophet
MSE = 157807679.90652037
RMSE = 12562.152678045286
MAE = 10071.553953152606

Deep Learning
MSE = 449731703.1802
RMSE = 21206.87867603811
MAE = 18007.781001223593



Experiment - Can we predict the S&P500

# Let us get some data.  We download the daily time series for the S&P500 for 30 months

import yfinance as yf
SPY = yf.download('SPY', start = '2013-01-01', end = '2015-06-30')
[*********************100%%**********************]  1 of 1 completed
# Clean up

SPY.index = pd.DatetimeIndex(SPY.index) # Set index
SPY = SPY.asfreq('B') # This creates rows for any missing dates
SPY.fillna(method = 'bfill', inplace=True) # Fills missing dates with last observation
SPY.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 649 entries, 2013-01-02 to 2015-06-29
Freq: B
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       649 non-null    float64
 1   High       649 non-null    float64
 2   Low        649 non-null    float64
 3   Close      649 non-null    float64
 4   Adj Close  649 non-null    float64
 5   Volume     649 non-null    float64
dtypes: float64(6)
memory usage: 35.5 KB
SPY['Returns'] = (SPY['Close'].shift(1) / SPY['Close']) - 1
SPY[['Returns']].plot(figsize = (22,6));

png

SPY[['Close']].plot(figsize = (22,6));

png

# Train-test split
train_samples = int(SPY.shape[0] * 0.8)

train_set = SPY.iloc[:train_samples]
test_set = SPY.iloc[train_samples:]

print("Training set: ", train_set.shape[0])
print("Test set: ", test_set.shape[0])
Training set:  519
Test set:  130
model = auto_arima(train_set['Close'], seasonal=True , m = 12)
order = model.get_params()['order']
seasonal_order = model.get_params()['seasonal_order']

print('Order = ', order)
print('Seasonal Order = ', seasonal_order)
Order =  (0, 1, 0)
Seasonal Order =  (0, 0, 0, 12)
model = ARIMA(train_set['Close'],order = order)
results = model.fit()
results.summary()

start=len(train_set)
end=len(train_set)+len(test_set)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels').rename('ARIMA Predictions')

train_set['Close'].rename('Training Set').plot(legend=True)
test_set['Close'].rename('Test Set').plot(legend=True)
predictions.plot(legend = True)
plt.show()

png

# Calculate Evaluation Metrics
y_test = test_set['Close']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})
y_test y_pred diff
2014-12-30 207.600006 208.720001 -1.119995
2014-12-31 205.539993 208.720001 -3.180008
2015-01-01 205.429993 208.720001 -3.290009
2015-01-02 205.429993 208.720001 -3.290009
2015-01-05 201.720001 208.720001 -7.000000
... ... ... ...
2015-06-23 212.039993 208.720001 3.319992
2015-06-24 210.500000 208.720001 1.779999
2015-06-25 209.860001 208.720001 1.139999
2015-06-26 209.820007 208.720001 1.100006
2015-06-29 205.419998 208.720001 -3.300003

130 rows × 3 columns

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  11.971797943699102
RMSE =  3.460028604462556
MAE =  2.769538996769832
plt.rcParams['figure.figsize'] = (20, 9)
# Create and fit model
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(train_set['Close'],order=order,seasonal_order=seasonal_order)
results = model.fit()


start=len(train_set)
end=len(train_set)+len(test_set)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels').rename('SARIMA Predictions')

train_set['Close'].rename('Training Set').plot(legend=True)
test_set['Close'].rename('Test Set').plot(legend=True)
predictions.plot(legend = True)
plt.show()

png

# Calculate Evaluation Metrics
y_test = test_set['Close']
y_pred = predictions
pd.DataFrame({'y_test': y_test, 'y_pred' : y_pred, 'diff':y_test - y_pred})
y_test y_pred diff
2014-12-30 207.600006 208.720001 -1.119995
2014-12-31 205.539993 208.720001 -3.180008
2015-01-01 205.429993 208.720001 -3.290009
2015-01-02 205.429993 208.720001 -3.290009
2015-01-05 201.720001 208.720001 -7.000000
... ... ... ...
2015-06-23 212.039993 208.720001 3.319992
2015-06-24 210.500000 208.720001 1.779999
2015-06-25 209.860001 208.720001 1.139999
2015-06-26 209.820007 208.720001 1.100006
2015-06-29 205.419998 208.720001 -3.300003

130 rows × 3 columns

from sklearn.metrics import mean_absolute_error, mean_squared_error
print('MSE = ', mean_squared_error(y_test,y_pred))
print('RMSE = ', np.sqrt(mean_squared_error(y_test,y_pred)))
print('MAE = ', mean_absolute_error(y_test,y_pred))
MSE =  11.971797943699102
RMSE =  3.460028604462556
MAE =  2.769538996769832