Hey there! welcome to my blog post. I hope you are doing great!
Feel free to contact me for any consultancy opportunity in the context of big data, forecasting, and prediction model development (idrisstsafack2@gmail.com) .
In my last post titled "ARMA models with R: the ultimate practical guide with Bitcoin data" I discussed on how to estimate and forecast a time series with an ARMA model. Something I precised and didn't discussed yet is that one of the most important condition of such statistical approaches is that the original time series data used for modelling should be stationary of the time series.
In this new blog post we will discuss on how to test the stationarity of a time series with R software. At the end of this post, you would be able to detect if a time series is stationary using either the visual/graphical or the inference/statistical testing approach.
Indeed, as I already said, one of the most important requirement of the ARMA model or any other classical time series modelling method is that the original time series data should be stationary. The problem is that in most of the cases, this doesn't work. The majority of time series data are not usually stationary. This prior step of time series analysis is critical because the usage of nonstationary time series usually lead to spurious regression, meaning that the estimated parameters from the considered model are not consistent.
In the context of machine learning for time series forecasting, this can potentially change. Let us define the stationarity of a time series.
What is Stationarity of time series?
We say that a time series is stationary if the multivariate cumulative distribution function of the time series does not change overtime. This is what we call the strong stationarity. Usually, this definition is less comprehensive.
A more simpler and comprehensive definition of the stationarity is what we call the weak stationarity. Indeed, a time series is said weakly stationary if
- its mean is constant
- its standard deviation is constant
- its cross covariance does not depend on time but depends only on the lag between the two series concerned.
Example of a Stationary time series
A simple and common example of a weakly stationary time series is the white noise normal distribution. Here is a simple simulation code of 100 replication of a white noise from a normal distribution with mean 0 and standard deviation equal to 2 with R software.
Here is a plot of the related time series.
As we can see the time series Y is oscillating around the value zero, meaning that zero is the average value of this time series and is approximately constant. Also the amplitude of the oscillation around the average value is almost the same over time.
These properties are less restrictive than the definition of the strong stationarity. When a time series does not verify at least one of these properties, it's naturally called a nonstationary time series. Indeed, a nonstationary time series displays seasonality, trend or cycle effects.
P.S. : Feel free to contact me for any consultancy opportunity in the context of big data, forecasting, and prediction model development (idrisstsafack2@gmail.com).
Example of a nonstationary time series
Here is an example of a nonstationary time series. Indeed, I simulate 100 observation from a random normal distribution called Y. From that distribution, I add a time-varying mean and standard deviation by multiplying the standard deviation by the time and adding the time variable to the mean value. I called the result from this combination the variable Z. Here is the code
Then we obtain the following graph
From this figure, we can see that the mean value of the time series Z represented by the red line is increasing with time. Basically, the time series tends to oscillate around that red line. On the same line the dispersion of the oscillations around the mean value (representing the standard deviation) is increasing with time, meaning that the standard deviation is time-varying. As the mean and the standard deviation are time-varying, we conclude that the time series Z is nonstationary. Here is one way to directly check on visual mode if a time series is stationary.
Now we can discuss on how to check the stationarity of the time series. Let first discuss the graphical approach.
Detecting if a time series is stationary with the graphical approach
In the previous section concerning the examples, we have developed the common approach used in order to detect if a time series is stationary visually. To summarize, the visual or graphical approach consist in plotting the graph of the time series and observe if the time series displays a trend, seasonality or a cycle effect. If the series displays one of these component, we conclude that the series is nonstationary. For the practical example, we will use the bitcoin historical prices at the daily frequency. Here is the graph of the bitcoin historical from Yahoo Finance from 2015 to 2020.
From this figure, we can observe that the time series displays an increasing trend in two different periods, that are 2015-2018 and 2019 - 2020. Also, we have a decreasing trend in the period 2018 - 2019. The presence of such trend means that the bitcoin time series is a nonstationary time series.
Here is the plot of the time series with its related trend
To add to this visual approach, one can also calculate the mean and the standard deviation for these three segments and check if we have the same mean and standard deviation in the different segments. We splitted the data in 3 segments.
segment1 = 2015 to 2018
segment2 = 2018 to 2019
segment3 = 2019 to 2020
Just to precise that before the calculations I needed to call the libraries "tidyverse" and "dplyr" in order to use the function drop_na to drop the NA after spliting the original time series
Then here is the code for the calculations
We obtain the following results :
We can see that the mean and the standard deviation of the 3 segments is significantly different from each other. This concludes that the daily data of Bitcoin is a nonstationary time series. It displays a deterministic stationarity (time-varying mean) and a stochastic stationarity (time-varying standard deviation).
Now let's talk about the statistical testing approach
Detecting the stationarity of a time series with the statistical test
So far we have discussed the graphical approach to check if a time series is stationary. The problem is that the graphical approach can create a kind of subjectivity in terms of decision rule. Then, it would be great to have a more objective approach and that is why the statistical testing approach has been developed. In this section, we will present how to use statistical test to check the stationarity of a time series.
The most popular approach to test the stationarity of a time series is the unit-root test or Dickey-Fuller test. Indeed, the intuition behind the the unit-root test is to highlight how significant is the trend observed in a time series. There is a large variety of the unit-root test. These varieties depend on the graphical observations that we get from the time series. If we observe a time-varying mean and/or a time-varying standard deviation of the time series, the configuration of the Dickey-Fuller test will change.
Another popular version of the Dickey-Fuller is the Augmented Dickey-Fuller. This version consider additionally the lag of the stochastic differences of the time series. Indeed, this version incorporates 3 different types of linear regressions :
- Type 1: Linear regression without a drift
The equation to be estimated is the following
dY[t] = a1Y[t-1] + b1dY[t-1] + ... + b[nlag]dY[t-nlag+1] + e[t]
where dY[t] = Y[t] - Y[t-1] , nlag = number of lag to consider for the differentiated value of the time series Y[t] and e[t] the error term
- Type 2 : Linear regression with a drift
We have the same equation as in type 1 and we add the drift called a0
dY[t] = a0 + a1Y[t-1] + b1dY[t-1] + ... + b[nlag]dY[t-nlag+1] + e[t]
- Type 3 : Linear regression with a drift and a deterministic trend
Now we add the trend term in the the equation presented in type 2
dY[t] = a0 + b0*t + a1Y[t-1] + b1dY[t-1] + ... + b[nlag]dY[t-nlag+1] + e[t]
The main purpose is to be able to test the significance of the coefficient b1 in the linear regression model.
It is important to precise that the simple Dickey-Fuller is the ADF with nlag = 2.
Here are the hythosesis of the statistical test :
Null Hyphothesis (H0) : The time series displays a unit-root. This means that the time series is nonstationary (b1= 0)
Alternate Hypothesis (H1) : There is no unit-root in the time series, meaning that the time series is stationary (b1 < 0).
The statistics of the test
The statistics of this test is nothing else than the ratio of the estimated value of b1 (from the linear regression) on its standard deviation.
T_STAT = b1_hat / SE(b1_hat)
where b1_hat is the estimated value of b1 from one of the three types of regressions and SE(b1_hat) is its standard deviation.
The decision rule of this statistical test is based on the statistic of the test or the p-value of the test with a considered significance threshold. We usually consider 3 different threshold that are 1%, 5% and 10%. The most used one in economics and finance is 5%.
Therefore, if the p-value is lower than 5%, then we do not have enough evidence to accept the null hypothesis. This conclude that the time series is stationary. Otherwise (p-value > 0.05), we conclude that the time series is nonstationary. One should use the Dickey-Fuller special table to read the true values of the t-test.
If T_STAT < TRUE_T_STAT_READ, then we reject the null hypothesis
To run the ADF test in R software, we use the following function
"adf.test(Data, nlag=Null, Output=True)"
Where Data represent the time series to be tested, nlag is the number of lags to consider and Output is an option for printing the results in the R console. In order to run the test, we need to call the package "tseries" in the library of R. If it is not install, you just have to install it and then call it in the library for usage. Here is the code with our Bitcoin data to test the stationarity with ADF test.
Hence, we obtain the following result
The number of lag used for the test is 12. We obtain a T_stat of -1.559 and a p-value of 0.765. Since the p-value > 0.05, we conclude that there is no enough evidence to reject the Null hypothesis, meaning that the time series of Bitcoin is nonstationary.
To conclude, we can use different techniques in order to check if a time series is stationary : the graphical approach and the statistical test approach. Both methods should be combined and exploited for an empirical analysis on any time series as they are both useful.
That's all for this post. I hope this post would be helpful. If you liked the post, please share it with friends and your community of machine learning engineers and data science. See you on the next post.
Feel free to contact me for any consultancy opportunity in the context of big data, forecasting, and prediction model development (idrisstsafack2@gmail.com) .
Comments