Nowadays, analyzing IoT data and extracting business value from it is becoming a hot topic. IoT data domain applications are diverse, to give few examples, we can list: Industry 4.0, smart city or connected health. Analyzing IoT data enables anomaly detection, predictive maintenance, prediction of stocks, air quality, weather or even illness. The list of usages is still long. If you want to have a global view of these usages and the relationship between data science and IoT, we invite you to read our blog article: Data Science & Internet of Things.
Among all types of IoT data, time series data (e.g., data from sensors indexed in time order) are becoming more and more widespread. They need to be stored and analysed. In the following, we introduce a time series forecasting technique using ARIMA modelling which is considered as the most general class of models for forecasting linear time series.
Please note that the purpose of this article is not to dive into the theory of time series, but to give you a quick understanding of how to analyse this type of data using Python. We will practice time series modelling using a dataset from kaggle about climate change and the variation of earth surface temperature (a topical issue!).
But first, what are time series?
Time series data are time-dependent unlike cross sectional data for instance. When performing a linear regression, one assumption is that there is no autocorrelation within independent variables. This assumption is not respected with time series data where a next value is affected by the previous one. Hence, since each variable from a time series sequence is correlated with the next one, by recursion, all the variables are correlated. This is called autocorrelation or serial correlation: the correlation of a signal with a lagged version of itself.
When analysing time series data, we usually look for these main components: trend, seasonal or cyclic patterns and fluctuations.
Time series data are mostly used to predict / forecast the future instances of a measured variable based on its past observations.
And how do we proceed to analyse this type of data?
In fact, the overall approach to analyse time series data is not that different from the one used for other data types. Indeed, we always have the phases of data cleaning and exploration, model choice and model fitting . In terms of model fitting, we will be using ARIMA models that catch well the dependency in time series data as described in the previous section.
Speaking about tools, it should be noted that pandas has some functionalities dedicated to analyse times series objects which allow to store time data and perform operations on it quickly. In this article, we will also use the statsmodels library and its module tsa which contains functions for time series analysis but also for time series modelling and forcasting.
Well, let’s practice through a simple example! As mentioned in the introduction, we will use a dataset from kaggle . We will forecast the monthly average temperature in France for the next years. From the kaggle dataset, we will focus only on the
'GlobalLandTemperaturesByCountry.csv' file. This file contains the temperature of each country in the world. We will consider only ‘France‘ as a country. The data ranges from 1743 to 2013.
Loading and analysing data using pandas
Let’s start by importing required libraries:
Then, we load the data and display few rows and data types:
Each row consists of 4 columns:
- ‘dt’: the date of recording.
- ‘AverageTemperature’: the average temperature per month expressed in Celsius.
- ‘AverageTemperatureUncertainty’: the 95% confidence interval around the average.
- ‘Country’: the country where the temperature is recorded.
In the following and considering the code snippet below, we choose not to take into account the AverageTemperatureUncertainty information (1), and we filter on France as a country (2).
At this stage, the data frame is still not read as a time series object. With pandas, the index can be the variable depicting date-time information. In our case, we use the column ‘dt’ as index (4). Since the dataset is too large, we use dates starting from 1950 (6).
We test whether we still have some NaN values after filtering on dates (7). There is 1 null value. There are many approaches to replace missing data. We choose here to replace NaN values with the last valid value (9).
Checking data stationarity
Stationarity is an assumption underlying many statistical procedures used in time series analysis and particularly in ARIMA modelling. Time series data are said to be stationary if their statistical properties such as mean and variance remain constant over time and its autocorrelation does not depend on time. Trend and seasonality usually affect the stationarity of a time series…but not always.
To check for data stationarity, we can first simply plot the data and analyse visually whether there is any trend or seasonality. Usually it is hard to see it visually: in that case, we can, for instance, plot the rolling mean and see if it varies over time.
Statsmodels library offers a statistical tool called the Augmented Dickey-Fuller Test which helps finding out stationary properties in a collection of data. The function below plots the rolling mean and then performs the Augmented Dickey-Fuller test.
We call the function with the suitable data:
We can see that the trend is almost invariant over the time. The average temperature is around 13° in the beginning and goes up around 14° starting from 1996.
Moreover, the Augmented Dickey-Fuller test is showing that the data is stationary: the p-value is small (<0.005) and the Test Statistic is lower than the Critical Value (1%), so it can be concluded with 99% confidence that the data is stationary.
If the data is not stationary, we need to make it so. One known way to do that is with differencing. This example shows how to differenciate time series data in order to make it stationary using Python.
Modelling and forecasting
n this part, we will learn how to fit the data using ARIMA modelling and make predictions. But first, let’s learn a bit about ARIMA models without going too deep into the theory.
ARIMA stands for Auto-Regressive Integrated Moving Averages. So, ARIMA is the combinaison of 3 models: Auto regressive, Integrated and Moving averages models that depend respectively on the parameters (p,d,q):
1- p represents the number of Auto-Regressive (AR) terms. In Auto-Regressive models, a variable is a linear function of past lags of itself. p represents the number of lags of the dependent variable to consider. We write an AR(p) model as follows:
Where are the parameters of the model, C is a constant and is a white noise term.
2- q represents the number of Moving Averages (MA) terms. In Moving-Average models, a variable is a linear function of the residuals from previous periods. q represents the number of lags of residuals to consider. We write an MA(q) model as follows:
Where θ1, …, θq are the parameters of the model and the εt, εt−1,…, εt−q are white noise error terms.
3- d represents the order of differencing. Differencing is a way of transforming a non-stationary series to a stationary one. This is done by subtracting the observation in the current period from the previous one. So if we consider y as the dth difference of the original time series y:
If your data is already stationary, you do not need the Integrated (I) part of ARIMA. You just need the ARMA model which is a linear combination of AR(p) and MA(q) terms.
A last question mark: how to find the best values for p and q parameters?
There are many ways to find the right values for p and q parameters. One way is to directly observe the plots of ACF (Autocorrelation function) and PACF (Partial Autocorrelation function). ACF measures the correlation between the time series data yt and a lagged version of itself yt–h. PACF measures the correlation between yt and yt–h after removing linear dependence on y1, y2, …, yt–h+1. This means that the intermediate correlations are eliminated and we keep only the residual correlation between yt and yt–h. ACF and PACF are tools to analyse the dependencies in a time series. So let’s plot ACF and PACF:
There are some rules to follow when analysing ACF and PACF plots in order to identify the numbers of AR and MA terms to consider:
- For an AR model, PACF drops when the order of the model (p) is reached while the ACF decreases exponentially. ACF will show some sinusoidal components for order higher than 1 (p>1). The number of non-zero partial autocorrelation will give the order (p) of AR.
- For an MA model, it is the opposite: the ACF will drop after the order (q) of the MA model is reached while the PACF will decrease exponentially. PACF will show some sinusoidal components for order higher than 1 (q>1).
- For ARIMA models (a mixture of both AR and MA models), we will see a mixture of exponentially decreasing and sinusoidal components (specially for orders higher than 1).
So, back to our plots, both ACF and PACF graphs above, show exponential decrease and some sinusoidal components. We can guess then that we have both AR and MA terms with orders higher than 1. PACF drops considerably at lag p=2 (it starts at zero at the graph but it represents the first lag) and ACF drops considerably at lag q=3, this gives us an estimation of the ARMA order = (2,3). The purple shadow on the two graphs represents the 95% confidence interval.
Usually, using ACF and PACF only isn’t enough to find the best values for p and q parameters that avoid overfitting the model. The best way to tune these parameters is to grid-search them. To evaluate and compare different ARMA models with different parameters p and q, we will use the AIC (Akaike Information Criterion) value, which is returned with ARIMA models in statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. We will write this code to do it.
A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same fit. Thus, we are interested in finding the model that yields the lowest
AIC value. The lowest returned AIC value above, is for (p,q)=(2,3). So we wil be using ARMA(2,3) model in the following to fit the data, we will also measure the MSE between real data and fitted data.
Now, that we defined a model for our time series data and estimated its MSE, we can use it to produce forecasts:
We represent the 10-year historic weather data with the red line and the next 10-year predicted data with the blue line. The grey area represents the 95% confidence interval.
We can see that the forecasted values match pretty closely the observed average temperature and they are all within the confidence interval of the forecast. Besides, the MSE is reasonable compared to the one calculated on fitted data.
We can display the predicted values throughout this code line. Below are some prediction examples. We can see that the temperature is slightly increasing from 2022 to 2023:
This is an introductory tutorial for analysing time series data aiming to provide a standard approach to solve time series problems using ARIMA models. We saw how to use pandas and statsmodels in order to manipulate, model them using ARMA model and forecast the average temperature of France. In IoT applications, ARIMA can be used for modelling temporal data from sensors to perform predictions.