Friday talks: The dark horse of Isolation Forest

When dealing with anomalies in the data, there are lots of challenges that should be solved. We’ve already had a couple of articles regarding this topic, and if you haven’t already read it, I suggest you to take a look at this one, aaand this one, from my fellow colleague Miloš. In this post, I’m going to talk about one of my biggest favourites amongst anomaly detection algorithms. It is simple, superfast and efficient, with low memory requirement and linear time complexity. I’m not going to talk about the usage, since I stumbled upon a great article on Towards Data Science (check it out: Outlier Detection with Isolation Forest). The idea is to share my impressions, and to inspire you to try it out, and give me your feedback about it.

Forest, you say?

As you can pre-assume, the Isolation Forest is an ensemble-based method. And if you’re familiar with how the Random Forest works (I know you are, we all love it!), there is no doubt that you’ll quickly master the Isolation Forest algorithm. So, how does it work?

One thing that should be clarified prior to the algorithm explanation is the concept of the “instance isolation”. While most of other algorithms are trying to model the normal behaviour, to learn the profile patterns, this algorithm tries to separate the anomalous instance from the rest of the data (thus the term “isolation”). The easier it is to isolate the instance, the higher the chances are it is anomalous. Like the authors from this paper suggest, most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. What is the problem with that? Well, defining the normal behaviour. The boundaries of normal behaviour. In most cases, it is too expensive and time-consuming to label the data, and get the information of normal and anomalous behaviour. And that is when the creativity steps out and takes place on the stage. To create a simple, but borderline ingenuity (okay, I’m a little bit biased here :D).

So, basically, Isolation Forest (iForest) works by building an ensemble of trees, called Isolation trees (iTrees), for a given dataset. A particular iTree is built upon a feature, by performing the partitioning. If we have a feature with a given data range, the first step of the algorithm is to randomly select a split value out of the available range of values. When the split value is chosen, the partitioning starts – each instance with a feature value lower than the split value is routed in the one side of the tree, while each instance with a feature value higher or equal than the split value is routed in the opposite side of the tree. In the second step, another random split value is selected, out of the available range of values for each side of the tree. This is recursively done until all instances are placed into terminal nodes (leaves), or some of the criteria set in the input parameters is met. The simplified process of building a tree is depicted below.

 

The idea is simple – if the instance is easily isolated (meaning there were fewer steps taken in order to place it into a terminal node), it has higher chances to be an anomaly. From the image that follows, it can be noticed that Xi instance from a dense area required many steps in order to be isolated, while the lonely Xo instance required far less.

So, in the iForest algorithm, the anomaly score is determined by the path length from the root node to the leaf in which the instance is placed. Since is it an ensemble, the average of all the path lengths for a given instance is taken. It can be induced that the average path length and anomaly score are inversely proportional – the shorter the path, the higher the anomaly score.  In case I was totally confusing, I will just leave the official documentation here.

There are two main parameters – the number of trees, and the sub-sampling size. The sub-sampling size controls the size of the sample that will be used from model training, when partitioning is done. In the official documentation, the authors suggest that they have empirically found that optimal values for these parameters are 100 and 256, for number of trees and sub-sampling size, respectively.

A hidden hero…

Now, there are two main problems that can be encountered when dealing with anomaly detection: swamping and masking. Swamping is a situation of wrongly identifying the normal instances as anomalous, which can happen when normal and anomalous instances are close to each other. Masking is the situation of wrongly identifying anomalous instances as normal, frequently happening when they are collectively situated in a dense area, so that they “conceal” their own presence. The sub-sampling in the iForest allows it to build a partial model that is resistant to these two effects. The next image shows how a subsampling could solve both problems easily. It can be noticed that sub-sampling cleared out the normal instances surrounding anomaly clusters, and reduced the size of anomalous clusters, which could lead to requiring more steps for isolation.

The scikit-learn implementation has some additional parameters like the maximum number of features to be considered, and contamination – the percent of anomalies that should be identified, which is quite nice. It basically works in such a way that a given percent of instances with the highest anomaly score is flagged as anomalous. It has a decision_function() which calculates the average anomaly score, and functions for fitting and predicting. And we haven’t gotten to the biggest forte of the algorithm yet! It can work both in supervised and unsupervised mode, which makes it a truly hidden hero of the machine learning realms. So, you can either feed it with the input data and labels, or only with the input data, it will crunch it and return the output.

Introducing some fairness into this post…

In order to reduce a bias, let’s talk about drawbacks, since there are some of them. The first, and the most important one – it is not able to work with multivariate time series, which is one of the biggest problems we are facing in practice. Regarding the scikit-learn implementation, one thing that really bothers me is that it is not possible to get the particular decision for each iTree. Besides, it is also impossible to visualize the trees, and let’s be honest – people looove to see the picture. There is nothing they like more than a perfectly depicted algorithm they can interpret all by themselves. It could be really useful to see which feature caused the anomaly, and of which intensity. Another thing that concerns me here is – how will it behave with the features that are slightly deviant? I’m afraid that it can lead to the similar path lengths, and to the problem of masking (correct me if I’m wrong). Finally, I haven’t really tested it with the categorical data, but if someone has the experience with it, please drop me a line.

Regarding its nature, Isolation Forest shows enviable performances when working with high-dimensional or redundant data. It is pretty powerful when fitted in appropriate way and for problems that doesn’t require contextual anomaly detection (like time series, or spacial analysis). And as it is superfast, I really like to use it. I’ve tried it on the retail and on the telco data, and the results were pretty satisfying.

Anomaly detection is still one of the biggest problems analysts encounter when working with data. There is no doubt that there will be more algorithms for detecting anomalies in the data, with lots of improvements and capabilities. But the most important thing is not to neglect the significance and impact of anomalies on the results and decision making, because they represent a very sensitive problem and should be carefully approached and analysed. Keep that in mind. Cheers! [nz_icons icon=”icon-wink” animate=”false” size=”small” type=”circle” icon_color=”” background_color=”” border_color=”” /]

Anomaly detection

The problem of anomaly detection is a very challenging problem often faced in data analysis. Whether it is about clustering, classification or some other machine learning problem, it is of great importance to identify anomalies and handle them in some way, in order to achieve optimal model performances. Furthermore, anomalies could often influence the analysis results, which could be the cause of drawing wrong conclusions, affecting the making of important business decisions. Thus, in every data analysis, it is required to accurately define anomalous behaviour in a certain domain, apply appropriate anomaly detection model, extract anomalies from the rest of the data and then continue with the analysis, model application and conclusion drawing.

Although anomalies are often considered to be some kind of irregularities inserting the noise in the data, they can contain more information within, than it is earlier believed. The term “anomaly” is often (and I would say wrongly) used as a synonym for the term  “outlier”. That is not a huge mistake, since the whole world is using them as synonyms. I like to differentiate them by describing their difference in the following way:  an “outlier” is an instance that differs significantly from the rest of the instances based on its values, or happens randomly and rarely in comparison with the rest of the instances, and thus, it could be considered irrelevant for the analysis, while an “anomaly” is a behaviour that is different than expected in relation to some previously recorded behavior and requires a deeper dive and cause analysis. For example, in some telecommunication network, we can have a cell that is overloaded, and in this case – it is considered to be an outlier among other cells in the network. But if it happens that in the same network for some period of time several different cells have the problem of overload, they as a group can represent an anomaly  that occurred, for example, due to some link failure, or congestion on  the link that is connecting them.

Anomaly detection process should include these following steps, in order to be carried out in the right way:

  1. Understanding the research domain
    • learning the basic concepts of the matter being analysed
    • consultations with the domain expert
    • determining and defining the term “anomaly” in a given domain
  2. Understanding the data
    • descriptive analysis – describing the data and getting basic insights into behaviour
    • exploratory analysis – detecting the hidden relationships and dependencies among features and getting more detailed insights into data
  3. Determining the set of techniques that could be used for the process
    • supervised learning techniques
    • unsupervised learning techniques
    • semi-supervised learning techniques
  4. Choosing the model
  5. Applying the model
  6. Evaluating the model
  7. Interpreting the detected anomalies
  8. Drawing conclusions

Defining the term “Anomaly”

Anomaly represents the type of behaviour in the data that differs from some expected behaviour. While the outlier can be interpreted as an instance that deviates from the rest of the instances, with no meaning, whose behaviour could easily be explained and thus ignored and removed, anomalies represent grouped or correlated outliers, deviations having a deeper cause different from simple human mistake or wrong reading, irregularities that are not so easily detected and explained, since they’re usually hidden among normal instances.

Types of anomalies

Anomalies could be grouped into three following classes:

  1. Point anomalies
  2. Contextual anomalies
  3. Collective anomalies

Point anomaly, as the term says, is an instance that could be considered as anomalous among other instances in the dataset. Point anomalies often represent some extremum, irregularity or deviation that happens randomly and have no particular meaning. I like to refer it as outlier. On the time series graph below, red points represent isolated point anomalies.

point anomalies

Contextual anomaly is an instance that could be considered as anomalous in some specific context. This means that observing the same point through different contexts will not always give us the indication of anomalous behavior.  The contextual anomaly is determined by combining contextual and behavioural features. For contextual features, time and space are most frequently used, while the behavioral features depend on the domain that is being analysed – amount of money spent, average temperature, or some other quantitative measure that is being used as a feature.

If we add some contextual feature, like time dimension, the same time series will look like following. Similar values are differently flagged for different time periods. If the time series below shows sales for each month, seasonal peaks in the holidays’ period (December) represent a growth in sales due to higher number of purchases, while the peak in July, 2017, represent unexpected, anomalous growth, that requires deeper analysis in order to be explained – the cause could be some sale action, music festival, or sport event. There are anomaly detection libraries that have the possibility to receive date ranges that are expected to have extreme values for some feature that is being analysed.

contextual anomaly

Collective anomaly is often represented as a group of correlated, interconnected or sequential instances. While each particular instance of this group doesn’t have to be anomalous itself,  their collective occurrence is anomalous.  

The time series below is pretty similar to the previous one, except that the growth is noted for the whole month of July, 2017. Since it is a one time event, there is no doubt that it’s about anomalous behaviour.

collective anomalies

It is very important to stress that contextual and collective anomalies do not require special handling but deeper analysis and root cause identification. Sometimes it is very important to give valuable root cause explanation in order to handle and smooth these kinds of anomalies in a proper way.

Anomaly detection techniques

Every anomaly detection technique belongs to one of the following basic approaches:

  1. supervised anomaly detection – includes modeling both the normal and anomalous behaviour. It is analogical to supervised approach for classification problem, and it requires labeled data. Model learns on the training data, trying to catch patterns for both types of behaviour, based on the available features. The goal is to obtain a model that could classify each new instance as normal or anomalous, based on the patterns that were caught in the training phase and instance attributes that had been given as an input. This approach includes classification based techniques.
  2. unsupervised anomaly detection – searching for anomalies with no previous knowledge of the data. It is analogical to unsupervised approach used for clustering, where similar instances are grouped into clusters, based on some similarity measure- whether be it distance, density or the position of the assigned node in a binary tree. Given the assumption that anomalies are well separated from the rest of the data, the goal is to obtain a model that could be able to group instances using the given similarity measure into clusters of normal instances, and anomalies that could as well form more than one cluster. This approach includes techniques based on clustering and nearest neighbour concept.
  3. semi-supervised anomaly detection – a mixture of the previous two types. This approach includes modeling just one type of behaviour, the most frequent – a normal one. It is considered to be semi-supervised since the model learns over the instances belonging to only one class. The advantage is that the model could be incrementally trained, as new instances appear. The goal is to obtain a model that has properly learned patterns of normal behaviour and is able to define a criteria or some kind of limit  that will be used to determine whether the instance’s behaviour corresponds to learned normal behaviour or not. This approach includes mostly techniques based on statistical methods or novelty detection.

The selection of the right set of techniques depends on the data being available. With labeled data, the decision is pretty simple. A lot could be done when the information of normal and anomalous behavior is given. With partially labeled or unlabeled data, this task is more complex and kind of a state of art. Detailed description of the most famous anomaly detection techniques could be found here, along with assumptions needed to be satisfied for each one of them in order to be used, as well as pros and cons, and required computer complexity.

In my next post, I will be talking about Isolation Forest algorithm, and how it can be used to detect anomalies in the data.