Forecasting with ARIMA

One of the most challenging machine learning problems is predicting some output based on the history of its previous values. The complexity of the problem multiplies as new features and constraints are added to analysis. Thus, in time series analysis it is not always enough to use previous values only, there often are many features that could impact on the output which should be predicted. There are numerous models and algorithms developed for this kind of problem. In this article, and series of articles that follows, several models, such as ARIMA, VAR and  Facebook Prophet, will be introduced and finally compared to see which one is best suitable for different kinds of problems.

ARIMA stands for autoregressive integrated moving average model, which produces forecasts based upon prior values in the time series (AR terms) and the errors made by previous predictions (MA terms), with possibility to work with non-stationary data, since it allows us to initially differentiate data to eliminate the non-stationarity (I term).

PyFlux library offers a very good implementation of ARIMA model in Python. It has useful documentation, followed by examples and it is very easy to use. To run  this model, we need to determine the optimal value for each of the parameters this function takes – AR, MA and I terms.

Problem definition

Data used for analysis include 125 instances, historical values, gathered during time period January-May 2017. Data take values mostly between 20-40, with deviation of +-15, which depends on the object that is observed. The problem is to predict the output for a given interval, based on these historical data. This is how our data look like.

Data plotted

Three things that should be investigated before running a model are time series stationarity, trend and seasonality, since each of them could negatively affect results obtained. That will tell us if we should differentiate data before training.

A rough conclusion that data is not stationary could be made by simply looking at the plot, but there is a little more elegant solution, which allows us to exactly determine if data is stationary – a Dickey-Fuller test, which has its implementation in StatsModels library. Here’s one example of how it can be used, taken from here.

These are results of a DF test applied to our data.

DF stationarity test 1

The test statistics is higher than the critical value, which leads to a conclusion that we cannot accept the null hypothesis of data stationarity.

The most popular technique used to make data stationary and eliminate trend and seasonality is differentiating.

ARIMA function in PyFlux library has a parameter that is used to set number of differencing order. In most cases, first order of differencing will do the job, but sometimes it is needed to differentiate data several times.

There are enough guidelines on how to determine the order of differencing and there is an excellent post that summarizes rules for identifying a proper ARIMA model, and set its parameters (https://people.duke.edu/~rnau/arimrule.htm ).

So, our data now look like this.

Differenced data

After re-running a stationarity test on differentiated data, we can see the following results.

DF stationarity test 2

A test statistics is now less than the critical value of 5% confidence interval, and we can adopt the null hypothesis that data is stationary.

Looks decent, so we could continue with analysis and run a model.

Set up parameters

The trickiest part of running an ARIMA model is determining the optimal number for AR and MA terms. There are two additional functions that can be used to solve this problem: ACF (autocorrelation function) and PACF (partial autocorrelation function). Their implementation can also be found in StatsModels library. Basically, both are used to determine how past and future data points are related in a time series, where the ACF function measures the correlation between two successive points, and the PACF tries to remove the transitive dependency between points (if a depends on b, and b depends on c, than a depends on c.)

Autocorrelation function can be used to determine the optimal number for MA terms by the following rule: the lag value where the ACF chart crosses the upper confidence interval for the first time is the recommended value for MA parameter. In this case, it is 29.

Autocorrelation function

Partial autocorrelation function is used to determine the optimal value for AR parameter in ARIMA function (the same rule as previous is used).

Partial autocorrekation function

The optimal value for AR parameter is 2.

Since we have optimal parameter values, we are able to run a model. One additional parameter that could be set is a distribution family of the time series – by default it is Normal but other options include Student-t, Laplace, Cauchy, etc.

Running a model


model = ARIMA(data=working_data, ar=2, ma=29, integ=1, target=’output’, family=pf.Normal())


The model is fitted with MLE – maximum likelihood point mass estimate method, which is used to estimate model parameters, by finding particular parametric values that make the observed results the most probable given the model.

Here’s the summary.

ARIMA model summary

AIC and BIC are used as model performance measures. AIC is acronym for Akaike Information Criteria and it is widely used as a measure of a statistical model. It basically quantifies the goodness of fit, and the simplicity of the model into a single statistics. BIC is a Bayesian information criteria. If a model is estimated on a training set, BIC score gives an estimate of the model performance on a testing set. The main rule is “the lower, the better”, and those two can be mixed to choose the best-fit model among several others.

And this is what results look like graphically.

ARIMA forecast results 1

Pretty powerful, right?

Well, at least we thought so, until we came up to something like this.

ARIMA forecast results 2

We were looking for a more stable, accurate model and weren’t satisfied with ARIMA and its 3.56 RMSE, so we dug deeper and found out that the output we’re trying to predict depends on many other features that should also be included in modeling. That’s when we applied a VAR model that allows us to introduce exogenous variables and train an autoregressive model which takes into account values of these external features beside the historical values of the output that should be predicted. We will be talking about it more in the next post.