Light reading for nerds: How to build a Visual Search application

In one of the previous blog posts, we have talked about Visual Search and its application in eCommerce. Now that we know what Visual Search is and where we can use it, it is time to learn what is beneath it and how we can easily implement our own Visual Search app.

Visual Search is one application of Computer Vision. A large number of the applications of Computer Vision are implemented with Neural Networks, specifically with Convolutional Neural Networks (CNNs). We will go through some main concepts of CNNs.

Convolutional Neural Network

Central to the convolutional neural network is the convolutional layer that gives the network its name. This layer performs an operation called a convolution, which is actually a linear operation that involves the multiplication of a set of weights with the input. The multiplication (dot product) is performed between an array of input data and a two-dimensional array of weights, called a filter or a kernel.


If the filter is designed to detect a specific type of feature in the input, then the application of that filter systematically across the entire input image allows the filter an opportunity to discover that feature anywhere in the image.

Because pixels on the corners or the edges of the picture are used much less in the output, there is a concept named padding, which pads an image with an additional border with n pixels.

Image source:

Also, there is something called stride, which affects the number of steps that are taken when applying the filter. Default stride is (1,1) for height and width movement. It is used for downsampling input so that we can get the same result but with less information.

Image source:

Besides the convolutional layer, in the CNNs there are two more types of layers:

  • Pooling layer – used to reduce the size of the representation, to speed up the computation, and to make some of the features that are a bit more robust to detect. There are two versions of this layer:
    • Max pooling (which is often used) – returns the maximum value from the portion of the image covered by the Kernel
    • Avg pooling (for very deep network) – returns the average value from the portion of the image covered by the Kernel
  • Fully connected layer – learning a possibly non-linear function in that space, for example, for classification

Image source:

CNN architectures

The most popular CNN architectures are:

  • LeNet – which was used for recognizing handwritten numbers, not so deep
  • AlexNet – deeper network, first that have been used ReLu as an activation function
  • VGGNet – it simplified neural network architectures
  • GoogLeNet – it uses many different kinds of methods such as 1×1 convolution and global average pooling that enables it to create deeper architecture
  • ResNet – residual networks allow training of a very deep neural network
  • Inception – trying several ways of pooling and convolutional layers and concatenate outputs

These architectures can be implemented from scratch, but there are some open-source pre-trained models which are using these architectures and that can be actually used for building any computer vision application. This approach is called transfer learning.

Transfer Learning

In order to have the best results in visual search, you need to train your model on the dataset as large as you can provide. That training will be computationally expensive and it’s not always possible to afford the required amount of training data. One of the solutions to this problem is Transfer learning. Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.

Image source:

So, instead of creating and training neural networks from the beginning, we can use pre-trained models as a starting point for our training and therefore reduce the time for training and improve the performance of the model.

Here, we have implemented Visual Search using the VGG16 pre-trained model.

VGG16 pre-trained model

ImageNet Large Scale Visual Recognition Challenge has been run annually from 2010 to the present. Researchers from the Oxford Visual Geometry Group, or VGG for short, participate in the ILSVRC challenge. The VGG model has been trained on the ImageNet dataset with over 15 million labeled images with high resolution collected from the Web. It gets an accuracy of 92.7% on the ImageNet dataset.

Image source:

VGG released two different CNN models, specifically a 16-layer model and a 19-layer model. Both of them are using stacked combination blocks of convolutional and pooling layers, with 3 fully connected layers at the end, where the third FC layer is used for classification. In order to use this architecture for Visual Search, we can drop those fully connected layers and keep only the part where the model extracts features from the image. We have two parts in our application – offline and main part. In the offline part, we just extract features for all images that we have in the training set. After extracting the features from every image, in the main part, we can apply Visual Search with measuring the similarity between the image’s features. There are several ways for doing that: with Euclidean distance, Hamming distance, Cosine similarity, etc. For the first version, we have tried with pure Euclidean distance and after uploading the image, we are returning top N most similar images.

For the visual presentation of this application, we have used Streamlit, a modern open-source Python framework used for creating and sharing custom web apps for machine learning and data science. In some of our future blog posts, you can read more about Streamlit.

Examples of results

We have used a dataset of shoe images for training that can be found here. Now, you can see few examples of using our Visual Search app:

Next steps

This was our first iteration of implementing the Visual Search app. The plan for the future is to try some other algorithms and see if they can give us better results. Also, we can improve Transfer Learning by removing the last few layers and retrain the whole network to get features that are more relevant to our training set of images. Also, we will try to integrate this with some of the most popular instant messaging applications like Viber, Whatsapp and Telegram and enable customers to use Visual Search not just within the web application. Of course, there is a place for improvement in the design of Streamlit web app.

If you have any questions regarding this topic or maybe suggestions for improvement, feel free to contact us! 🙂


Photo credits:

How to model better recommendations in Covid time?

Needs in the IT sector are constantly changing. After Coronavirus hit us unexpectedly, this is true, more than ever. In order to keep distance from each other, we were forced to limit all our activities that can’t be done online. At first, this was shocking and we weren’t prepared for such change. But when we got used to it, we have realised that online business isn’t the future, it is NOW.

Users now expect reliable eCommerce platforms, with satisfying user experience. Having in mind a great number of users, products and competition, it is clear that this is really hard to achieve. So, during quarantine, we tried to make the most of it and to come up with something that can be valuable to our clients and that will improve both the eBusiness and core business. We decided to focus on creating the next generation recommender systems.

In our other blog posts, you can find a detailed explanation of recommender systems. In short, recommender systems on e-commerce websites suggest new items to customers by collecting preferences of people based on the analysis of people’s behavior. Recommender systems can bring numerous benefits to companies that are using them:

  1. users can find desirable products much faster
  2. users can get matching products during cart check out process
  3. prevent users from abandoning shopping carts
  4. trigger emails based on online interactions
  5. much better user experience, especially with personalized recommendations or even whole pages


The next generation recommender systems

The e-commerce industry is going to expand in a bigger scale, and so are recommendation engines. The next generation recommender systems are expected to include the following features:

  • more personalized recommendations – recommender systems would become more capable of digging deep into the customers’ data insights which will help them in presenting them with more-relevant, customer centric recommendations.
  • reach customers through multiple channels – the recommender systems in the future would be more capable of reaching out to the users across various mediums like emails, social media channels, on an off-site shopping widgets, mobile apps, etc.
  • real time recommendations – recommender systems based on deep learning can provide real time behavior to the model. They aim to present the right items to a user, at the time that it is most useful to her.

Knowing all of this, we tried to find architecture for a new recommender engine that can work with online data, in real time and that can gain insights about users for better and more personalized recommendations. That “unicorn” is the HRNN model.

Session-based recommender with HRNN (Hierarchical Recurrent Neural Networks)

Why session-based recommender? In many online systems where recommendations are applied, interactions between a user and the system are organized into sessions and those sessions have a goal – to find some product or service. If the model is aware of what is the intent of the  user in a given session, performance of recommendations can be improved with that information and they will be more relevant. But user history is also important – if two users in previous sessions have different interests, they should get different recommendations in current sessions, even if current sessions are the same. In order to achieve that, we are using the HRNN model.


In one of our next blog posts, we will write about HRNN in detail, but in general, the idea is that we apply this algorithm when user identifiers are present and propagate information from the previous user session to the next, thus improving the recommendation accuracy.


In the picture, we can see two layers of neural networks. The upper one is session-level representation and in its memory cells it keeps information of just one session. Those informations (output) are input in the lower network, which is the user-level representation. Memory cells of lower network keeps informations about all sessions for one user and propagate those informations for every next session. With this architecture, we have covered cases when we have a user identifier, but if we don’t, this architecture comes down to just session-based recommender and takes into account just current session data. As a result, we get top N recommendations for every user session, and that’s exactly what we wanted in the first place.

If you are interested in how this was implemented in the real use case and what were the results, follow our blog, we will be happy to share our experience on this in our future blog posts. You can contact us to get some deeper explanation, or if you have an impression to share on improving and developing recommender engines of the next generation – we would be more than happy to hear it. 🙂


More sources:

The cover photo is taken from

Time series Anomaly Detection using a Variational Autoencoder (VAE)

Why time series anomaly detection?


Let’s say you are tracking a large number of business-related or technical KPIs (that may have seasonality and noise). It is in your interest to automatically isolate a time window for a single KPI whose behavior deviates from normal behavior (contextual anomaly – for the definition refer to this post). When you have the problematic time window at hand you can further explore the values of that KPI. You can then link the anomaly to an event which caused the unexpected behavior. Most importantly, you can then act on the information.

To do the automatic time window isolation we need a time series anomaly detection machine learning model. The goal of this post is to introduce a probabilistic neural network (VAE) as a time series machine learning model and explore its use in the area of anomaly detection. As this post tries to reduce the math as much as possible, it does require some neural network and probability knowledge.



As Valentina mentioned in her post there are three different approaches to anomaly detection using machine learning based on the availability of labels:

  1. unsupervised anomaly detection
  2. semi-supervised anomaly detection
  3. supervised anomaly detection

Someone who has knowledge of the domain needs to assign labels manually. Therefore, acquiring precise and extensive labels is a time consuming and an expensive process. I’ve deliberately put unsupervised as the first approach, since it doesn’t require labels. It does, however, require that normal instances outnumber the abnormal ones. Not only do we require an unsupervised model, we also require it to be good at modeling non-linearities.

What model? Enter neural networks…


Historically, different kinds of neural networks have had success with modeling complex non-linear data (e.g. image, sound and text data). However, universal function approximators that they are, they have inevitably found their way into modeling tabular data. One interesting type of tabular data modeling is time-series modeling.

A model that has made the transition from complex data to tabular data is an Autoencoder(AE). Autoencoder consists of two parts – encoder and decoder. It tries to learn a smaller representation of its input (encoder) and then reconstruct its input from that smaller representation (decoder). An anomaly score is designed to correspond to the reconstruction error.

Autoencoder has a probabilistic sibling Variational Autoencoder(VAE), a Bayesian neural network. It tries not to reconstruct the original input, but the (chosen) distribution’s parameters of the output. An anomaly score is designed to correspond to an – anomaly probability. Choosing a distribution is a problem-dependent task and it can also be a research path. Now we delve into slightly more technical details.


Both AE and VAE use a sliding window of KPI values as an input. Model performance is mainly determined by the size of the sliding window.

Diggin’ deeper into Variational Autoencoders…



The smaller representation in the VAE context is called a latent variable and it has a prior distribution (chosen to be the Normal distribution). The encoder is its posterior distribution and the decoder is its likelihood distribution. Both of them are Normal distribution in our problem. A forward pass would be:

  1. Encode an instance into a mean value and standard deviation of latent variable
  2. Sample from the latent variable’s distribution
  3. Decode the sample into a mean value and standard deviation of the output variable
  4. Sample from the output variable’s distribution

Variational Autoencoder as probabilistic neural network (also named a Bayesian neural network). It is also a type of a graphical model. An in-depth description of graphical models can be found in Chapter 8 of Christopher Bishop‘s Machine Learning and Pattern Recongnition.

A TensorFlow definition of the model:

class VAE(object):
    def __init__(self, kpi, z_dim=None, n_dim=None, hidden_layer_sz=None):
          z_dim : dimension of latent space.
          n_dim : dimension of input data.
        if not z_dim or not n_dim:
            raise ValueError("You should set z_dim"
                             "(latent space) dimension and your input n_dim."
                             " \n            ")

        def make_prior(code_size):
            loc = tf.zeros(code_size)
            scale = tf.ones(code_size)
            return tfd.MultivariateNormalDiag(loc, scale)

        self.z_dim = z_dim
        self.n_dim = n_dim
        self.kpi = kpi
        self.dense_size = hidden_layer_sz
        self.input = tf.placeholder(dtype=tf.float32,shape=[None, n_dim], name='KPI_data')
        self.batch_size = tf.placeholder(tf.int64, name="init_batch_size")

        # api
        dataset = \
        self.ite = dataset.make_initializable_iterator()
        self.x = self.ite.get_next()
        # Define the model.
        self.prior = make_prior(code_size=self.z_dim)
        x = tf.contrib.layers.flatten(self.x)
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        loc = tf.layers.dense(x, self.z_dim)
        scale = tf.layers.dense(x, self.z_dim , tf.nn.softplus)
        self.posterior = tfd.MultivariateNormalDiag(loc, scale)
        self.code = self.posterior.sample()

        # Define the loss.
        x = self.code
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        loc = tf.layers.dense(x, self.n_dim)
        scale = tf.layers.dense(x, self.n_dim , tf.nn.softplus)
        self.decoder = tfd.MultivariateNormalDiag(loc, scale)
        self.likelihood = self.decoder.log_prob(self.x)
        self.divergence = tf.contrib.distributions.kl_divergence(self.posterior, self.prior)
        self.elbo = tf.reduce_mean(self.likelihood - self.divergence)
        self._cost = -self.elbo
        self.saver = tf.train.Saver()
        self.sess = tf.Session()

def fit(self, Xs, learning_rate=0.001, num_epochs=10, batch_sz=200, verbose=True):
        self.optimize = tf.train.AdamOptimizer(learning_rate).minimize(self._cost)

        batches_per_epoch = int(np.ceil(len(Xs[0]) / batch_sz))
        print("Training anomaly detector/dimensionalty reduction VAE for KPI",self.kpi)
        print("There are",batches_per_epoch, "batches per epoch")
        start = timer()
        for epoch in range(num_epochs):
            train_error = 0

                    self.input: Xs,
                    self.batch_size: batch_sz})

            for step in range(batches_per_epoch):
                _, loss =[self.optimize, self._cost])
                train_error += loss
                if step == (batches_per_epoch - 1):
                        mean_loss = train_error / batches_per_epoch   
            if verbose:
                    "Epoch {:^6} Loss {:0.5f}"  .format(
                        epoch + 1, mean_loss))
            if train_error == np.nan:
                return False
        end = timer()
        print("Training time {:0.2f} minutes".format((end - start) / (60)))
        return True

Theory is great… What about real world?


Using the model on one of the data sets from the Numenta Anomaly Benchmark(NAB):


In this case the model was able to achieve a true positive rate (TPR = 1.0) and a false positive rate (FPR = 0.07). For various anomaly probability thresholds we get a ROC curve:

Choosing the threshold read from the ROC curve plot we get the following from the test set:

Just as the ROC curve suggested, the model was able to completely capture the abnormal behavior. Alas, as all neural network models are in need of hyperparameter tuning, this beast is no exception. However the only hyperparameter that can greatly affect the performance is the size of the sliding window.

I hope I was successful in introducing this fairly complex model in simple terms. I encourage you to try the model on other data sets available from here.


Keep on learning and things solving!