A Crash Course in Time Series Decomposition and Forecasting with a Vector Autoregression (VAR) Model
I’ve dealt with projects that involved time series analysis and forecasting on-and-off for the past several years, but I’ve always found the topic somewhat inaccessible for beginners due to the lack of comprehensive Python tutorials available. So, in an effort to correct this, this post walks beginners through time series analysis, using monthly electricity price data for the state of TX available via the Energy Information Administration’s API. Specifically, this tutorial covers time series decomposition and vector autoregression (VAR) modelling to forecast electricity prices for the state of Texas.
First, a little background on the Energy Information Administration, or the EIA. The EIA is a branch in the US Department of Energy responsible for collecting and analyzing energy-related data, including oil and gas, coal, nuclear, electric, and renewables. Via its Open Data application programming interface (API), users can directly pull the EIA time series data into Python for analysis.
For more background on setting up EIA API access in Python, check out this tutorial: https://techrando.com/2019/06/26/how-to-use-the-energy-information-administration-eia-application-programming-interface-api-and-pull-live-data-directly-into-python-for-analysis/
In this analysis, we’re going to pull the time series for electricity prices for the state of Texas into Python for analysis, as shown below:
Snapshot of the time series data for electricity prices, pulled via the EIA API
First, let’s look at whether or not the monthly electricity data displays seasonality and a trend. To do this, we use the seasonal_decompose() function in the statsmodels.tsa.seasonal package. This function breaks down a time series into its core components: trend, seasonality, and random noise. The code and its outputs are displayed below:
Time Series Decomposition: Monthly Electricity Prices in TX
So what does the plot above mean? If we take a closer look at the seasonal component of the decomposition, we can validate a trend that logically makes sense: during the summer months in Texas, electricity prices skyrocket. For anyone who has been to Texas in the middle of summer, it’s sweltering hot and practically unlivable without air conditioning (I know, I live in Houston). So, more air conditioning leads to more electricity usage, which leads to higher electricity prices.
Additionally, there is a second, smaller spike in electricity prices around the beginning of each year. This is likely due to the weather being cool enough that Texans resort to heating their homes, offices, etc. As most of Texas’s weather is fairly mild in the winter, this spike is significantly less pronounced than the spike in electricity prices during the summer.
Now that we’ve done some initial analysis using time series decomposition, we want to determine if we can predict electricity prices by using another predictor time series as a proxy. In order to get an idea of what factors may affect electricity prices, let’s first check out how electricity is generated in Texas. On its website, the EIA provides a breakdown of the sources of electricity in Texas, shown in the bar graph below:
Image courtesy of https://www.eia.gov/state/?sid=TX#tabs-4
Based on the above graph, natural gas appears to be the main source of electricity generation in Texas as of March 2019, with coal-fired and renewables tied for second.
Although we’re analyzing electricity data going back to 2001 and the breakdown above is from March 2019, for the sake of simplicity we’ll assume that natural gas has been one of the main sources of electricity generation in Texas for the past 15–20 years. Consequently, we’re going to pull the time series of natural gas prices via the EIA API and compare side-by-side to the TX electricity price time series:
Plotted Natural Gas Prices and Electricity Prices over Time
There appears to be more variation in the natural gas price time series (the highs are especially high and the lows are especially low), but for the most part, the trends in both time series appear very similar. In fact, it’s highly likely we could estimate TX electricity price using natural gas price as a proxy.
This presents an interesting problem — using lagged values of a time series, along with lagged values other time series sets, to predict future values. One approach to solving a problem such as this is called vector autoregression, or VAR. VAR is an extension of the autoregressive (or AR) model, where multiple variables are used when generating predictions. VAR models are very popular because of their flexibility when analyzing economic and financial time series, and are great for forecasting.
The image below gives a breakdown of a VAR equation with two variables and first order dynamics. It is a very simple linear equation, where the x- and y-values at time t-1 are used to predict the x- and y-values at time t. As variable x depends on variable y and vice versa, they are jointly considered using a system of equations. Basically, VAR models work by using lagged data for multiple dependent time series to predict future values.
VAR equation for two variables and first-order dynamics: Image courtesy ofhttp://www.ams.sunysb.edu/~zhu/ams586/VAR_Lecture2.pdf
The following handout available via the University of Washington gives an in-depth description of the mathematics behind the VAR model, if you would like further background: https://faculty.washington.edu.
Additionally, the University of Chicago Booth School of Business gives a great high-level overview of VAR modeling in the following presentation: https://faculty.chicagobooth.edu/jeffrey.russell/teaching/timeseries/handouts/notes5.pdf
As with most econometric analysis, there are a couple of requirements for the VAR model to work. First, the passed time series need to be stationary, i.e. the statistical properties of the times series, including mean, variance, and autocorrelation, need to be constant over time. Consequently, time series that display seasonality or a trend are NOT stationary.
There are a couple approaches that we can take to make a model stationary. They are as follows:
- Difference the time series. Data differencing is, more or less, calculating the instantaneous velocity at each data point, or the amount the time series changes from one value to the next.
- Transform the time series. This can be done by applying a log or power transformation to the data.
Next, in order for a VAR model to work, the sampling frequency (i.e. daily, monthly, yearly data) needs to be the same. If it isn’t, the data needs to be converted to the same frequency using imputation/linear interpolation, etc.
In the code below, each time series is transformed using the numpy natural log function, and then differenced by one interval:
In order to test if the time series are now stationary, we use the Augmented Dickey Fuller Test. The Augmented Dickey-Fuller Test tests the null hypothesis that a unit root is present in a time series (i.e. if a unit root is in the time series, then it isn’t stationary). To successfully determine that a time series is stationary, the test must return a p-value of less than .05. In the Python code below, we run the Augmented Dickey-Fuller test on the transformed, differenced time series, determining that they are both stationary (the p-values returned are .000299 and 0 for the electricity and natural gas time series, respectively):
Outputs for the Augmented Dickey-Fuller Test for the electricity price time series and the natural gas price time series, respectively
Now that we’ve made our two time series stationary, it’s time to fit the data to a VAR model. The Python code below successfully builds the model and returns a summary of the results, where we use a 95/5 percent split for the training/validation sets:
VAR Model Summary, with y1=Electricity Price Time Series, and y2=Natural Gas Price Time Series
What we really want to focus on in the model summary above is the equation for y1, where y1 estimates the electricity prices in the state of Texas based on lagged values of itself and lagged natural gas prices.
There is both a t-statistic and p-value associated with the lagged electricity price data (L1.y1), and lagged natural gas price data (L1.y2). The t-statistic compares the data to expected results under the null hypothesis. If sample results exactly equal the null hypothesis, then the t-statistic is 0. Alternatively, the higher the t-statistic value, the more likely that we can reject the null hypothesis. Basically, the higher the t-statistic, the more likely that there is a correlation between the two variables.
Each t-statistic has a corresponding p-value. P-values are also used to reject the null hypothesis. The lower the p-value, the stronger the evidence against the null hypothesis. Normally a p-value under .05 is used to determine statistical significance.
For more information on t-statistics and p-values, check out the following link: https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics
With all of this information in mind, let’s interpret the results of the model summary for equation y1.
For the lagged electricity price variable, there is a t-statistic of 3.002 and a p-value 0.003 — clearly, lagged values of the electricity price time series affect future values of the time series, and we can reject the null hypothesis.
Lagged values of the natural gas price time series are also promising with a t-statistic of 2.225 and a p-value of 0.026, well below the cutoff for statistical significance, and a clear indicator that lagged values in the natural gas time series affect future values in the electricity price time series.
Now that we’ve built a model, it’s time to generate and compare the predictions to actual data in the test/validation set. It is important to note that the data must be un-differenced and back-transformed in order to understand the electricity price predictions. The following code performs the un-differencing and back-transformation operations, as well as calculating the forecasting metrics associated with evaluating model performance, including forecast bias, mean absolute error, mean squared error and root mean squared error:
Actual electricity price vs. VAR-model predicted electricity price, predicted out 10 months (after un-differencing and back-transformation) Accuracy metrics for the forecast: forecast bias, mean absolute error, mean squared error, and root mean square error
The results here look very promising! The VAR appears to estimate electricity prices fairly accurately. The forecast bias metric is -0.041638, indicating that the model has a tendency to over-forecast, or predict too high. Mean absolute error, mean squared error, and root mean squared error are also fairly low, which is good. Based onmean absolute error (MAE), there is an average $.22 variation between the forecasted and the actual values.
For more background on time series forecasting model metrics, check out the following link: https://machinelearningmastery.com/time-series-forecasting-performance-measures-with-python/
This concludes this tutorial. Thank you for reading!
The full script for this analysis is available via Github: https://github.com/kperry2215/electricity_price_time_series_analysis/blob/master/analyze_monthly_electricity_data.py
Check out some of my other energy-related Python tutorials:
Automating oil and gas decline curve analysis: https://techrando.com/2019/07/03/how-to-automate-decline-curve-analysis-dca-in-python-using-scipys-optimize-curve_fit-function/
Analyzing open source FracFocus data to gain completions insights: https://techrando.com/2019/07/14/using-publicly-available-fracfocus-data-and-pythons-matplotlib-function-to-visualize-oil-and-gas-companies-completions-strategies-in-the-permian-basin/