Time series is a term that you must or would have faced in your Data Science career. If you are completely new to this, don’t worry, it is really intuitive. What is actually a Time Series? It is a collection of data points collected at constant time intervals. So, we have a history data about some feature, which was collected every day or at some other time interval. Important thing is that it has to be the same interval every time. In the next picture you can see one example of Time Series.
Within the Time Series, we could have different applications. There are two ways that we could use Time Series: we could use just historical data and analyse it, which is called Time Series analysis, or we could use historical data in order to predict future values of it, which is called Time Series Forecasting. In some later blogs, you will find a little more about Forecasting, but now, we are focusing on analysis.
Time Series Analysis
Time series analysis is a popular field of Data Science which includes developing models (statistical or machine learning models) that can describe observed Time Series in the best way and maybe explain underlying causes and patterns. In order to do so, we can use several techniques, like anomaly detection, create autoregressive models, recognize trend patterns, measure similarity of Time Series… We can already see that it could be useful in many different industries. We found it useful in a few of our projects, when we wanted to find answers for questions like:
- Why does the data change this way?
- Are there any patterns in behaviour of some time series features?
- Which time series show similar patterns?
We have done detailed research in order to find the best way for answering those questions. Finally, we have decided to try Time Series clustering.
Time Series clustering
Main goal of Time Series clustering is to partition Time Series data into groups based on similarity or distance, so that Time Series in the same cluster are similar. At first, it looks like that it is the same problem as any basic clustering, but here we have specific data and specific decisions to make before fitting the model.
Image source: Comparing Time-Series Clustering Algorithms in R Using the dtwclust Package by Alexis Sarda-Espinosa
The main decisions that we need to make are:
- how to measure the similarity between Time Series
- how to compress the Time Series data or reduce dimension and
- what algorithm to use for clustering
There are many ways to define similarity measures for clustering. First choice for a similarity measure would be Euclidean distance. The problem with using the Euclidean distance measure is that it often produces pessimistic similarity measures when it encounters distortion in the time axis. Also, we could not compare Time Series that are not the same length. The way to deal with this is to use Dynamic Time Warping as a similarity measure.
Dynamic time warping finds the optimal non-linear alignment between two Time Series. This similarity measure requires a blog for itself. If you are interested more in how this works, you can find an excellent explanation in this video. But why is this useful? If we want to detect the same pattern in some data, maybe we don’t care for the exact time that this pattern happened, but we just want to detect it and put them in the same cluster. For example, if we are analysing retailer’s sales for many stores, we want to detect where we have sudden peak in sales, cluster them and then analyze further to see what have caused that, no matter when that peak happened in time.
Image source: Dynamic time warping under pointwise shape context by Zheng Zhang, PingTang and Rubing Duan
Algorithms for clustering Time Series
The other important step is to decide which clustering algorithm to use. Here, I have listed those which enable using DTW as a similarity measure, their characteristics and requirements.
In our case, KMeans turned out to be the best algorithm for Time Series clustering. It was the fastest and the simplest solution. Nevertheless, we encourage you to try out all of them, because results depend on the data that you have been using.
Of course, it is not enough to just cluster your data. You have another problem – to find the optimal number of clusters. In order to do that, you have to define an evaluation metric which you will optimize. Here are the most widely used evaluation metrics for clustering.
After research, it was concluded that we can rely on the Silhouette score, but we will also monitor Calinski-Harabasz index (because Davies-Bouldin is restricted to using Euclidean distance for distance metric, which we are not using). It is a good practice to have two evaluation metrics, just to be sure that gained results are optimal.
The most important thing when doing Time Series clustering is to understand data and domain that data comes from. Maybe our evaluation metric gives us one number for optimal clusters, but we should make the final decision about it when we analyze results and see how we can interpret the results. If you are working on this with some domain expert, then you are lucky – he can help you to find the best solution in the most efficient way. 🙂
Cover photo credits: https://unsplash.com/photos/nN5L5GXKFz8