Light reading for nerds: How to build a Visual Search application

In one of the previous blog posts, we have talked about Visual Search and its application in eCommerce. Now that we know what Visual Search is and where we can use it, it is time to learn what is beneath it and how we can easily implement our own Visual Search app.

Visual Search is one application of Computer Vision. A large number of the applications of Computer Vision are implemented with Neural Networks, specifically with Convolutional Neural Networks (CNNs). We will go through some main concepts of CNNs.

Convolutional Neural Network

Central to the convolutional neural network is the convolutional layer that gives the network its name. This layer performs an operation called a convolution, which is actually a linear operation that involves the multiplication of a set of weights with the input. The multiplication (dot product) is performed between an array of input data and a two-dimensional array of weights, called a filter or a kernel.

Source: https://medium.datadriveninvestor.com/convolutional-neural-networks-3b241a5da51e

If the filter is designed to detect a specific type of feature in the input, then the application of that filter systematically across the entire input image allows the filter an opportunity to discover that feature anywhere in the image.

Because pixels on the corners or the edges of the picture are used much less in the output, there is a concept named padding, which pads an image with an additional border with n pixels.

Image source: https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1

Also, there is something called stride, which affects the number of steps that are taken when applying the filter. Default stride is (1,1) for height and width movement. It is used for downsampling input so that we can get the same result but with less information.

Image source: https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1

Besides the convolutional layer, in the CNNs there are two more types of layers:

  • Pooling layer – used to reduce the size of the representation, to speed up the computation, and to make some of the features that are a bit more robust to detect. There are two versions of this layer:
    • Max pooling (which is often used) – returns the maximum value from the portion of the image covered by the Kernel
    • Avg pooling (for very deep network) – returns the average value from the portion of the image covered by the Kernel
  • Fully connected layer – learning a possibly non-linear function in that space, for example, for classification

Image source: https://qph.fs.quoracdn.net/main-qimg-cf2833a40f946faf04163bc28517959c

CNN architectures

The most popular CNN architectures are:

  • LeNet – which was used for recognizing handwritten numbers, not so deep
  • AlexNet – deeper network, first that have been used ReLu as an activation function
  • VGGNet – it simplified neural network architectures
  • GoogLeNet – it uses many different kinds of methods such as 1×1 convolution and global average pooling that enables it to create deeper architecture
  • ResNet – residual networks allow training of a very deep neural network
  • Inception – trying several ways of pooling and convolutional layers and concatenate outputs

These architectures can be implemented from scratch, but there are some open-source pre-trained models which are using these architectures and that can be actually used for building any computer vision application. This approach is called transfer learning.

Transfer Learning

In order to have the best results in visual search, you need to train your model on the dataset as large as you can provide. That training will be computationally expensive and it’s not always possible to afford the required amount of training data. One of the solutions to this problem is Transfer learning. Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.

Image source: https://medium.com/@1154_75881/transfer-learning-628e83df5c8a

So, instead of creating and training neural networks from the beginning, we can use pre-trained models as a starting point for our training and therefore reduce the time for training and improve the performance of the model.

Here, we have implemented Visual Search using the VGG16 pre-trained model.

VGG16 pre-trained model

ImageNet Large Scale Visual Recognition Challenge has been run annually from 2010 to the present. Researchers from the Oxford Visual Geometry Group, or VGG for short, participate in the ILSVRC challenge. The VGG model has been trained on the ImageNet dataset with over 15 million labeled images with high resolution collected from the Web. It gets an accuracy of 92.7% on the ImageNet dataset.

Image source: https://pub.towardsai.net/the-architecture-and-implementation-of-vgg-16-b050e5a5920b

VGG released two different CNN models, specifically a 16-layer model and a 19-layer model. Both of them are using stacked combination blocks of convolutional and pooling layers, with 3 fully connected layers at the end, where the third FC layer is used for classification. In order to use this architecture for Visual Search, we can drop those fully connected layers and keep only the part where the model extracts features from the image. We have two parts in our application – offline and main part. In the offline part, we just extract features for all images that we have in the training set. After extracting the features from every image, in the main part, we can apply Visual Search with measuring the similarity between the image’s features. There are several ways for doing that: with Euclidean distance, Hamming distance, Cosine similarity, etc. For the first version, we have tried with pure Euclidean distance and after uploading the image, we are returning top N most similar images.

For the visual presentation of this application, we have used Streamlit, a modern open-source Python framework used for creating and sharing custom web apps for machine learning and data science. In some of our future blog posts, you can read more about Streamlit.

Examples of results

We have used a dataset of shoe images for training that can be found here. Now, you can see few examples of using our Visual Search app:

Next steps

This was our first iteration of implementing the Visual Search app. The plan for the future is to try some other algorithms and see if they can give us better results. Also, we can improve Transfer Learning by removing the last few layers and retrain the whole network to get features that are more relevant to our training set of images. Also, we will try to integrate this with some of the most popular instant messaging applications like Viber, Whatsapp and Telegram and enable customers to use Visual Search not just within the web application. Of course, there is a place for improvement in the design of Streamlit web app.

If you have any questions regarding this topic or maybe suggestions for improvement, feel free to contact us! 🙂

 

Photo credits: https://unsplash.com/photos/7okkFhxrxNw

AI for eCommerce: How to use Visual Search

Lately we’ve been hearing different kinds of buzzwords all the time – ‘deep learning’, ‘neural networks’, ‘computer vision’, ‘artificial intelligence’… But the explanation of each one of them is actually not so simple. Still, it’s not hard to understand the main concepts of them in order to use it in some real-world applications. One of the real-world applications that incorporates mentioned concepts is Visual Search. In this blog post, we will try to explain what Visual Search is, why and when we should use it.

What is Visual Search?

Visual Search is extremely interesting because of its implementation in business and different industries. It enables using images for searching instead of text. Humans are visual beings – 90% of information transmitted to the human brain is visual. So, having the possibility of exploring the online offer with just an image can be something that now will make a difference between retailers, but in the not so distant future, it could be the default feature of every eCommerce website. 

Visual Search != Image Search

Both are connected with images, but the main difference is that in Image Search users use textual words for searching images and in Visual Search they use images for the same thing.

Why should we use it?

It seems that this is something that will be required for any eCommerce platform in the near future, especially for fashion and furniture retailers. Statistics show that 62% of millennials want visual search over any other new technology. And why is that? Because Visual Search makes searching for desirable products much simpler: users can simply take a photo of a desired item, use it to inform their search and generate an immediate result of all the similar products currently available. It is perfect for shoppers who face two common dilemmas: “I don’t know what I want, but I’ll know it when I see it” and “I know what I want, but I don’t know what it’s called”- as people from Amazon say.

Use cases in eCommerce

Many online retailers are already using Visual Search as one of the features on their platforms. Even though they are not retailers, Google and Pinterest are leaders in this field. They have developed interesting and accurate Visual Search tools and they are inspiration to others that are trying to implement something similar. 

Pinterest

On Pinterest, you can upload a picture of your outfit and get the pictures with similar outfits with their Pinterest Lens tool.

Also, they have improved their Visual Search tool with Shop the look feature. With just one tap, you can buy items from the picture. They went even further by adding Complete the look possibility which is actually a recommender of complementary products. All this combined could be an online shopping assistant, which is undoubtedly useful for those retailers with a wide range of products.

Google

Google and its Google Lens tool are game changers in the Visual Search field. Some of Google Lens main features are:

Amazon

Another big player has developed a Visual Search tool – StyleSnap. With this tool you can easily replicate the look from the picture that you have uploaded. This could be done on web app or mobile app. 

Amazon also has a collaboration with Snapchat – a tool which enables Snapchat users to take a photo of some product and buy it on Amazon.

Other retailers

Many other retailers followed Pinterest, Google and Amazon with their own solutions for Visual Search: eBay within its app has a feature for searching with an image, Forever 21 developed a feature called Discover Your Style, which increased their conversion rates significantly, IKEA also has a tool for searching a furniture with an image to make online shopping easier. Those are just several most interesting examples, but even lesser known retailers recognize Visual Search as one of the keys which can differentiate them from the competition.

If you want to know more about Visual Search, how it actually works and how it can be implemented, stay tuned, because we’ve got more of this coming! 🙂