Forecasting with ARIMA

Valentina Djordjevic Technology 9 Comments

One of the most challenging machine learning problems is predicting some output based on the history of its previous values. The complexity of the problem multiplies as new features and constraints are added to analysis. Thus, in time series analysis it is not always enough to use previous values only, there often are many features that could impact on the output which should be predicted. There are numerous models and algorithms developed for this kind of problem. In this article, and series of articles that follows, several models, such as ARIMA, VAR and  Facebook Prophet, will be introduced and finally compared to see which one is best suitable for different kinds of problems.

ARIMA stands for autoregressive integrated moving average model, which produces forecasts based upon prior values in the time series (AR terms) and the errors made by previous predictions (MA terms), with possibility to work with non-stationary data, since it allows us to initially differentiate data to eliminate the non-stationarity (I term).

PyFlux library offers a very good implementation of ARIMA model in Python. It has useful documentation, followed by examples and it is very easy to use. To run  this model, we need to determine the optimal value for each of the parameters this function takes – AR, MA and I terms.

Problem definition

Data used for analysis include 125 instances, historical values, gathered during time period January-May 2017. Data take values mostly between 20-40, with deviation of +-15, which depends on the object that is observed. The problem is to predict the output for a given interval, based on these historical data. This is how our data look like.

Data plotted

Three things that should be investigated before running a model are time series stationarity, trend and seasonality, since each of them could negatively affect results obtained. That will tell us if we should differentiate data before training.

A rough conclusion that data is not stationary could be made by simply looking at the plot, but there is a little more elegant solution, which allows us to exactly determine if data is stationary – a Dickey-Fuller test, which has its implementation in StatsModels library. Here’s one example of how it can be used, taken from here.

These are results of a DF test applied to our data.

DF stationarity test 1

The test statistics is higher than the critical value, which leads to a conclusion that we cannot accept the null hypothesis of data stationarity.

The most popular technique used to make data stationary and eliminate trend and seasonality is differentiating.

ARIMA function in PyFlux library has a parameter that is used to set number of differencing order. In most cases, first order of differencing will do the job, but sometimes it is needed to differentiate data several times.

There are enough guidelines on how to determine the order of differencing and there is an excellent post that summarizes rules for identifying a proper ARIMA model, and set its parameters (https://people.duke.edu/~rnau/arimrule.htm ).

So, our data now look like this.

Differenced data

After re-running a stationarity test on differentiated data, we can see the following results.

DF stationarity test 2

A test statistics is now less than the critical value of 5% confidence interval, and we can adopt the null hypothesis that data is stationary.

Looks decent, so we could continue with analysis and run a model.

Set up parameters

The trickiest part of running an ARIMA model is determining the optimal number for AR and MA terms. There are two additional functions that can be used to solve this problem: ACF (autocorrelation function) and PACF (partial autocorrelation function). Their implementation can also be found in StatsModels library. Basically, both are used to determine how past and future data points are related in a time series, where the ACF function measures the correlation between two successive points, and the PACF tries to remove the transitive dependency between points (if a depends on b, and b depends on c, than a depends on c.)

Autocorrelation function can be used to determine the optimal number for MA terms by the following rule: the lag value where the ACF chart crosses the upper confidence interval for the first time is the recommended value for MA parameter. In this case, it is 29.

Autocorrelation function

Partial autocorrelation function is used to determine the optimal value for AR parameter in ARIMA function (the same rule as previous is used).

Partial autocorrekation function

The optimal value for AR parameter is 2.

Since we have optimal parameter values, we are able to run a model. One additional parameter that could be set is a distribution family of the time series – by default it is Normal but other options include Student-t, Laplace, Cauchy, etc.

Running a model


model = ARIMA(data=working_data, ar=2, ma=29, integ=1, target=’output’, family=pf.Normal())


The model is fitted with MLE – maximum likelihood point mass estimate method, which is used to estimate model parameters, by finding particular parametric values that make the observed results the most probable given the model.

Here’s the summary.

ARIMA model summary

AIC and BIC are used as model performance measures. AIC is acronym for Akaike Information Criteria and it is widely used as a measure of a statistical model. It basically quantifies the goodness of fit, and the simplicity of the model into a single statistics. BIC is a Bayesian information criteria. If a model is estimated on a training set, BIC score gives an estimate of the model performance on a testing set. The main rule is “the lower, the better”, and those two can be mixed to choose the best-fit model among several others.

And this is what results look like graphically.

ARIMA forecast results 1

Pretty powerful, right?

Well, at least we thought so, until we came up to something like this.

ARIMA forecast results 2

We were looking for a more stable, accurate model and weren’t satisfied with ARIMA and its 3.56 RMSE, so we dug deeper and found out that the output we’re trying to predict depends on many other features that should also be included in modeling. That’s when we applied a VAR model that allows us to introduce exogenous variables and train an autoregressive model which takes into account values of these external features beside the historical values of the output that should be predicted. We will be talking about it more in the next post.

Comments 9

  1. Nikola Perovic

    How is the DF stationarity test setup in this Python library since (in terms of Ho/Ha) its stated that once p value is less than 5% threshold we cannot reject the null hypothesis of covariance stationarity?(e.g. this would mean that the null is Ho: phi <1) Usually these hypotheses are setup in the sense that low p values means you have enough evidence to reject the null, under some given threshold…Thanks!

    1. Post
      Author
      Valentina Djordjevic

      That’s correct, but the null hypothesis of this test says that there is a unit root present in a time series sample. The alternative hypothesis states that there is no unit root, which means data is stationary. The criteria says: if the p_value is greater than a critical value, then we cannot reject that there is a unit root.
      I defined it the other way, for easier understanding – Ho: data is stationary, H1: data is not stationary, and thus changed the criteria to if the p_value is smaller, we cannot reject the null hypothesis. My bad, should have followed the rules. Excellent observation, many thanks for that, I will pay more attention to that in the future! 🙂

  2. Nemanja

    Hi, Valentina,

    great article! Your explanation of the MLE method is the most concise and intuitive one I’ve read so far.

    Could you please just clarify the cause of the difference between predictions in the last two plots? Is this a train/test performance, performance on different entities, or something third?

    Thanks in advance! 🙂

    1. Post
      Author
      Valentina Djordjevic

      Hi, Nemanja,

      Thank you very much! 🙂

      Both plots are predictions on a test set. The second plot shows the same model ran on data from another object. Seems like objects are very different in between, that’s why we had to involve more features to gain more information on object behavior and finally obtain more accurate results. Hopefully I explained it well. 🙂

  3. James

    I have a question about your example. Although not explicitly shown, I assume that the you feed the original (un-differenced) data series into your ARIMA model. How do you convert the results from the ARIMA model back to the original scale of the input data.

    In my example (when integ=1) my predicted results are on the same scale as the differenced data not the original (un-differenced) data.

    Thanks in advance!

    1. Post
      Author
      Valentina Djordjevic

      That’s correct. It was bothering me, too, so I googled it, and found a way to solve it. I returned it to original scale by taking the first original value of the time series, appending first the rest of the differenced original series, and then predicted values below it, and finally by performing cumsum(). You can find the example on Vidhya blog post, that I referenced before using Dickey-Fuller test. 🙂

      1. James

        Great, your recommendation worked. I believe that integration also works but this is a cleaner process.

        Following your recommendation, I converted my results from the ARIMA(p,1,q) model (which were on the scale of the differenced data) to pandas series, performed pandas.cumsum(), then added a scalar value (i.e. the last value of my training data) to the pandas series. This process converted my data back to original scale.

        Keep in mind that I am differencing once (d=1) in my ARIMA model.

        It’s also worth mentioning that the ARIMA model provided by Statsmodel library will automatically convert the results back to the original scale. The ARIMA model in Pyflux does not.

        1. Post
          Author
          Valentina Djordjevic

          That’s the right procedure, good job. 🙂

          First order of differencing will in most cases solve the problem. 🙂

          You’re right and as far as I know, PyFlux does not have the support of reverse scaling, that may be something to work on in the future.

          P.S. I would like to hear your comparison of these implementations, since I haven’t tried Statsmodel solution yet. You could write me if you’d like to exchange thoughts on valentina@thingsolver.com. 🙂

        2. Terry

          Hi James,

          Would be great if you could share your python code. Facing the same issue – predicted values are scaled to differenced data.

Leave a Reply

Your email address will not be published. Required fields are marked *