Light reading for nerds: How to build a Visual Search application

In one of the previous blog posts, we have talked about Visual Search and its application in eCommerce. Now that we know what Visual Search is and where we can use it, it is time to learn what is beneath it and how we can easily implement our own Visual Search app.

Visual Search is one application of Computer Vision. A large number of the applications of Computer Vision are implemented with Neural Networks, specifically with Convolutional Neural Networks (CNNs). We will go through some main concepts of CNNs.

Convolutional Neural Network

Central to the convolutional neural network is the convolutional layer that gives the network its name. This layer performs an operation called a convolution, which is actually a linear operation that involves the multiplication of a set of weights with the input. The multiplication (dot product) is performed between an array of input data and a two-dimensional array of weights, called a filter or a kernel.


If the filter is designed to detect a specific type of feature in the input, then the application of that filter systematically across the entire input image allows the filter an opportunity to discover that feature anywhere in the image.

Because pixels on the corners or the edges of the picture are used much less in the output, there is a concept named padding, which pads an image with an additional border with n pixels.

Image source:

Also, there is something called stride, which affects the number of steps that are taken when applying the filter. Default stride is (1,1) for height and width movement. It is used for downsampling input so that we can get the same result but with less information.

Image source:

Besides the convolutional layer, in the CNNs there are two more types of layers:

  • Pooling layer – used to reduce the size of the representation, to speed up the computation, and to make some of the features that are a bit more robust to detect. There are two versions of this layer:
    • Max pooling (which is often used) – returns the maximum value from the portion of the image covered by the Kernel
    • Avg pooling (for very deep network) – returns the average value from the portion of the image covered by the Kernel
  • Fully connected layer – learning a possibly non-linear function in that space, for example, for classification

Image source:

CNN architectures

The most popular CNN architectures are:

  • LeNet – which was used for recognizing handwritten numbers, not so deep
  • AlexNet – deeper network, first that have been used ReLu as an activation function
  • VGGNet – it simplified neural network architectures
  • GoogLeNet – it uses many different kinds of methods such as 1ร—1 convolution and global average pooling that enables it to create deeper architecture
  • ResNet – residual networks allow training of a very deep neural network
  • Inception – trying several ways of pooling and convolutional layers and concatenate outputs

These architectures can be implemented from scratch, but there are some open-source pre-trained models which are using these architectures and that can be actually used for building any computer vision application. This approach is called transfer learning.

Transfer Learning

In order to have the best results in visual search, you need to train your model on the dataset as large as you can provide. That training will be computationally expensive and itโ€™s not always possible to afford the required amount of training data. One of the solutions to this problem is Transfer learning. Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.

Image source:

So, instead of creating and training neural networks from the beginning, we can use pre-trained models as a starting point for our training and therefore reduce the time for training and improve the performance of the model.

Here, we have implemented Visual Search using the VGG16 pre-trained model.

VGG16 pre-trained model

ImageNet Large Scale Visual Recognition Challenge has been run annually from 2010 to the present. Researchers from the Oxford Visual Geometry Group, or VGG for short, participate in the ILSVRC challenge. The VGG model has been trained on the ImageNet dataset with over 15 million labeled images with high resolution collected from the Web. It gets an accuracy of 92.7% on the ImageNet dataset.

Image source:

VGG released two different CNN models, specifically a 16-layer model and a 19-layer model. Both of them are using stacked combination blocks of convolutional and pooling layers, with 3 fully connected layers at the end, where the third FC layer is used for classification. In order to use this architecture for Visual Search, we can drop those fully connected layers and keep only the part where the model extracts features from the image. We have two parts in our application – offline and main part. In the offline part, we just extract features for all images that we have in the training set. After extracting the features from every image, in the main part, we can apply Visual Search with measuring the similarity between the image’s features. There are several ways for doing that: with Euclidean distance, Hamming distance, Cosine similarity, etc. For the first version, we have tried with pure Euclidean distance and after uploading the image, we are returning top N most similar images.

For the visual presentation of this application, we have used Streamlit, a modern open-source Python framework used for creating and sharing custom web apps for machine learning and data science. In some of our future blog posts, you can read more about Streamlit.

Examples of results

We have used a dataset of shoe images for training that can be found here. Now, you can see few examples of using our Visual Search app:

Next steps

This was our first iteration of implementing the Visual Search app. The plan for the future is to try some other algorithms and see if they can give us better results. Also, we can improve Transfer Learning by removing the last few layers and retrain the whole network to get features that are more relevant to our training set of images. Also, we will try to integrate this with some of the most popular instant messaging applications like Viber, Whatsapp and Telegram and enable customers to use Visual Search not just within the web application. Of course, there is a place for improvement in the design of Streamlit web app.

If you have any questions regarding this topic or maybe suggestions for improvement, feel free to contact us! ๐Ÿ™‚


Photo credits: