Friday talks: The dark horse of Isolation Forest

When dealing with anomalies in the data, there are lots of challenges that should be solved. We’ve already had a couple of articles regarding this topic, and if you haven’t already read it, I suggest you to take a look at this one, aaand this one, from my fellow colleague Miloš. In this post, I’m going to talk about one of my biggest favourites amongst anomaly detection algorithms. It is simple, superfast and efficient, with low memory requirement and linear time complexity. I’m not going to talk about the usage, since I stumbled upon a great article on Towards Data Science (check it out: Outlier Detection with Isolation Forest). The idea is to share my impressions, and to inspire you to try it out, and give me your feedback about it.

Forest, you say?

As you can pre-assume, the Isolation Forest is an ensemble-based method. And if you’re familiar with how the Random Forest works (I know you are, we all love it!), there is no doubt that you’ll quickly master the Isolation Forest algorithm. So, how does it work?

One thing that should be clarified prior to the algorithm explanation is the concept of the “instance isolation”. While most of other algorithms are trying to model the normal behaviour, to learn the profile patterns, this algorithm tries to separate the anomalous instance from the rest of the data (thus the term “isolation”). The easier it is to isolate the instance, the higher the chances are it is anomalous. Like the authors from this paper suggest, most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. What is the problem with that? Well, defining the normal behaviour. The boundaries of normal behaviour. In most cases, it is too expensive and time-consuming to label the data, and get the information of normal and anomalous behaviour. And that is when the creativity steps out and takes place on the stage. To create a simple, but borderline ingenuity (okay, I’m a little bit biased here :D).

So, basically, Isolation Forest (iForest) works by building an ensemble of trees, called Isolation trees (iTrees), for a given dataset. A particular iTree is built upon a feature, by performing the partitioning. If we have a feature with a given data range, the first step of the algorithm is to randomly select a split value out of the available range of values. When the split value is chosen, the partitioning starts – each instance with a feature value lower than the split value is routed in the one side of the tree, while each instance with a feature value higher or equal than the split value is routed in the opposite side of the tree. In the second step, another random split value is selected, out of the available range of values for each side of the tree. This is recursively done until all instances are placed into terminal nodes (leaves), or some of the criteria set in the input parameters is met. The simplified process of building a tree is depicted below.

 

The idea is simple – if the instance is easily isolated (meaning there were fewer steps taken in order to place it into a terminal node), it has higher chances to be an anomaly. From the image that follows, it can be noticed that Xi instance from a dense area required many steps in order to be isolated, while the lonely Xo instance required far less.

So, in the iForest algorithm, the anomaly score is determined by the path length from the root node to the leaf in which the instance is placed. Since is it an ensemble, the average of all the path lengths for a given instance is taken. It can be induced that the average path length and anomaly score are inversely proportional – the shorter the path, the higher the anomaly score.  In case I was totally confusing, I will just leave the official documentation here.

There are two main parameters – the number of trees, and the sub-sampling size. The sub-sampling size controls the size of the sample that will be used from model training, when partitioning is done. In the official documentation, the authors suggest that they have empirically found that optimal values for these parameters are 100 and 256, for number of trees and sub-sampling size, respectively.

A hidden hero…

Now, there are two main problems that can be encountered when dealing with anomaly detection: swamping and masking. Swamping is a situation of wrongly identifying the normal instances as anomalous, which can happen when normal and anomalous instances are close to each other. Masking is the situation of wrongly identifying anomalous instances as normal, frequently happening when they are collectively situated in a dense area, so that they “conceal” their own presence. The sub-sampling in the iForest allows it to build a partial model that is resistant to these two effects. The next image shows how a subsampling could solve both problems easily. It can be noticed that sub-sampling cleared out the normal instances surrounding anomaly clusters, and reduced the size of anomalous clusters, which could lead to requiring more steps for isolation.

The scikit-learn implementation has some additional parameters like the maximum number of features to be considered, and contamination – the percent of anomalies that should be identified, which is quite nice. It basically works in such a way that a given percent of instances with the highest anomaly score is flagged as anomalous. It has a decision_function() which calculates the average anomaly score, and functions for fitting and predicting. And we haven’t gotten to the biggest forte of the algorithm yet! It can work both in supervised and unsupervised mode, which makes it a truly hidden hero of the machine learning realms. So, you can either feed it with the input data and labels, or only with the input data, it will crunch it and return the output.

Introducing some fairness into this post…

In order to reduce a bias, let’s talk about drawbacks, since there are some of them. The first, and the most important one – it is not able to work with multivariate time series, which is one of the biggest problems we are facing in practice. Regarding the scikit-learn implementation, one thing that really bothers me is that it is not possible to get the particular decision for each iTree. Besides, it is also impossible to visualize the trees, and let’s be honest – people looove to see the picture. There is nothing they like more than a perfectly depicted algorithm they can interpret all by themselves. It could be really useful to see which feature caused the anomaly, and of which intensity. Another thing that concerns me here is – how will it behave with the features that are slightly deviant? I’m afraid that it can lead to the similar path lengths, and to the problem of masking (correct me if I’m wrong). Finally, I haven’t really tested it with the categorical data, but if someone has the experience with it, please drop me a line.

Regarding its nature, Isolation Forest shows enviable performances when working with high-dimensional or redundant data. It is pretty powerful when fitted in appropriate way and for problems that doesn’t require contextual anomaly detection (like time series, or spacial analysis). And as it is superfast, I really like to use it. I’ve tried it on the retail and on the telco data, and the results were pretty satisfying.

Anomaly detection is still one of the biggest problems analysts encounter when working with data. There is no doubt that there will be more algorithms for detecting anomalies in the data, with lots of improvements and capabilities. But the most important thing is not to neglect the significance and impact of anomalies on the results and decision making, because they represent a very sensitive problem and should be carefully approached and analysed. Keep that in mind. Cheers! [nz_icons icon=”icon-wink” animate=”false” size=”small” type=”circle” icon_color=”” background_color=”” border_color=”” /]

Time series Anomaly Detection using a Variational Autoencoder (VAE)

Why time series anomaly detection?

 

Let’s say you are tracking a large number of business-related or technical KPIs (that may have seasonality and noise). It is in your interest to automatically isolate a time window for a single KPI whose behavior deviates from normal behavior (contextual anomaly – for the definition refer to this post). When you have the problematic time window at hand you can further explore the values of that KPI. You can then link the anomaly to an event which caused the unexpected behavior. Most importantly, you can then act on the information.

To do the automatic time window isolation we need a time series anomaly detection machine learning model. The goal of this post is to introduce a probabilistic neural network (VAE) as a time series machine learning model and explore its use in the area of anomaly detection. As this post tries to reduce the math as much as possible, it does require some neural network and probability knowledge.

Background

 

As Valentina mentioned in her post there are three different approaches to anomaly detection using machine learning based on the availability of labels:

  1. unsupervised anomaly detection
  2. semi-supervised anomaly detection
  3. supervised anomaly detection

Someone who has knowledge of the domain needs to assign labels manually. Therefore, acquiring precise and extensive labels is a time consuming and an expensive process. I’ve deliberately put unsupervised as the first approach, since it doesn’t require labels. It does, however, require that normal instances outnumber the abnormal ones. Not only do we require an unsupervised model, we also require it to be good at modeling non-linearities.

What model? Enter neural networks…

 

Historically, different kinds of neural networks have had success with modeling complex non-linear data (e.g. image, sound and text data). However, universal function approximators that they are, they have inevitably found their way into modeling tabular data. One interesting type of tabular data modeling is time-series modeling.

A model that has made the transition from complex data to tabular data is an Autoencoder(AE). Autoencoder consists of two parts – encoder and decoder. It tries to learn a smaller representation of its input (encoder) and then reconstruct its input from that smaller representation (decoder). An anomaly score is designed to correspond to the reconstruction error.

Autoencoder has a probabilistic sibling Variational Autoencoder(VAE), a Bayesian neural network. It tries not to reconstruct the original input, but the (chosen) distribution’s parameters of the output. An anomaly score is designed to correspond to an – anomaly probability. Choosing a distribution is a problem-dependent task and it can also be a research path. Now we delve into slightly more technical details.

 

Both AE and VAE use a sliding window of KPI values as an input. Model performance is mainly determined by the size of the sliding window.

Diggin’ deeper into Variational Autoencoders…

 

 

The smaller representation in the VAE context is called a latent variable and it has a prior distribution (chosen to be the Normal distribution). The encoder is its posterior distribution and the decoder is its likelihood distribution. Both of them are Normal distribution in our problem. A forward pass would be:

  1. Encode an instance into a mean value and standard deviation of latent variable
  2. Sample from the latent variable’s distribution
  3. Decode the sample into a mean value and standard deviation of the output variable
  4. Sample from the output variable’s distribution

Variational Autoencoder as probabilistic neural network (also named a Bayesian neural network). It is also a type of a graphical model. An in-depth description of graphical models can be found in Chapter 8 of Christopher Bishop‘s Machine Learning and Pattern Recongnition.

A TensorFlow definition of the model:

class VAE(object):
    def __init__(self, kpi, z_dim=None, n_dim=None, hidden_layer_sz=None):
        """
        Args:
          z_dim : dimension of latent space.
          n_dim : dimension of input data.
        """
        if not z_dim or not n_dim:
            raise ValueError("You should set z_dim"
                             "(latent space) dimension and your input n_dim."
                             " \n            ")

        tf.reset_default_graph()
        
        def make_prior(code_size):
            loc = tf.zeros(code_size)
            scale = tf.ones(code_size)
            return tfd.MultivariateNormalDiag(loc, scale)

        self.z_dim = z_dim
        self.n_dim = n_dim
        self.kpi = kpi
        self.dense_size = hidden_layer_sz
        
        self.input = tf.placeholder(dtype=tf.float32,shape=[None, n_dim], name='KPI_data')
        self.batch_size = tf.placeholder(tf.int64, name="init_batch_size")

        # tf.data api
        dataset = tf.data.Dataset.from_tensor_slices(self.input).repeat() \
            .batch(self.batch_size)
        self.ite = dataset.make_initializable_iterator()
        self.x = self.ite.get_next()
        
        # Define the model.
        self.prior = make_prior(code_size=self.z_dim)
        x = tf.contrib.layers.flatten(self.x)
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        loc = tf.layers.dense(x, self.z_dim)
        scale = tf.layers.dense(x, self.z_dim , tf.nn.softplus)
        self.posterior = tfd.MultivariateNormalDiag(loc, scale)
        self.code = self.posterior.sample()

        # Define the loss.
        x = self.code
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        loc = tf.layers.dense(x, self.n_dim)
        scale = tf.layers.dense(x, self.n_dim , tf.nn.softplus)
        self.decoder = tfd.MultivariateNormalDiag(loc, scale)
        self.likelihood = self.decoder.log_prob(self.x)
        self.divergence = tf.contrib.distributions.kl_divergence(self.posterior, self.prior)
        self.elbo = tf.reduce_mean(self.likelihood - self.divergence)
        self._cost = -self.elbo
        
        self.saver = tf.train.Saver()
        self.sess = tf.Session()

def fit(self, Xs, learning_rate=0.001, num_epochs=10, batch_sz=200, verbose=True):
        
        self.optimize = tf.train.AdamOptimizer(learning_rate).minimize(self._cost)

        batches_per_epoch = int(np.ceil(len(Xs[0]) / batch_sz))
        print("\n")
        print("Training anomaly detector/dimensionalty reduction VAE for KPI",self.kpi)
        print("\n")
        print("There are",batches_per_epoch, "batches per epoch")
        start = timer()
        
        self.sess.run(tf.global_variables_initializer())
        
        for epoch in range(num_epochs):
            train_error = 0

            
            self.sess.run(
                self.ite.initializer,
                feed_dict={
                    self.input: Xs,
                    self.batch_size: batch_sz})

            for step in range(batches_per_epoch):
                _, loss = self.sess.run([self.optimize, self._cost])
                train_error += loss
                if step == (batches_per_epoch - 1):
                        mean_loss = train_error / batches_per_epoch   
            if verbose:
                print(
                    "Epoch {:^6} Loss {:0.5f}"  .format(
                        epoch + 1, mean_loss))
                
            if train_error == np.nan:
                return False
        end = timer()
        print("\n")
        print("Training time {:0.2f} minutes".format((end - start) / (60)))
        return True

Theory is great… What about real world?

 

Using the model on one of the data sets from the Numenta Anomaly Benchmark(NAB):

 

In this case the model was able to achieve a true positive rate (TPR = 1.0) and a false positive rate (FPR = 0.07). For various anomaly probability thresholds we get a ROC curve:

Choosing the threshold read from the ROC curve plot we get the following from the test set:

Just as the ROC curve suggested, the model was able to completely capture the abnormal behavior. Alas, as all neural network models are in need of hyperparameter tuning, this beast is no exception. However the only hyperparameter that can greatly affect the performance is the size of the sliding window.

I hope I was successful in introducing this fairly complex model in simple terms. I encourage you to try the model on other data sets available from here.

 

Keep on learning and things solving!