Introduction to recommender systems

After watching Udemy online course Building Recommender Systems with Machine Learning and AI, I came up with the idea to write a text that can help beginners to understand the basic ideas of the recommender systems.

A recommender system, or a recommendation system is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item.

In the last decade companies have invested a lot of money in their development. Netflix awarded a $1 million prize to a developer team in 2009 for an algorithm that increased the accuracy of the company’s recommendation engine by 10 percent.

There are two main types of recommender systems – personalized and non-personalized.

Picture 1 – Types of recommender systems

 

Non-personalized recommendation systems like popularity based recommenders recommend the most popular items to the users, for instance top-10 movies, top selling books, the most frequently purchased products.

What is a good recommendation?

  • The one that is personalized (relevant to that user)
  • The one that is diverse (includes different user interests)
  • The one that doesn’t recommend the same items to users for the second time
  • The one that recommends available products

Personalized

Personalized recommender system analyzes users data, their purchases, rating and their relationships with other users in more detail. In that way every user will get customized recommendations.

The most popular types of personalized recommendation systems are content based and collaborative filtering.

Content based

Content based recommender systems use items or users metadata to create specific recommendations. The user’s purchase history is observed. For example if a user has already read a book from one author or bought a product of a certain brand it is assumed that the customer has a preference for that author or that brand and there is a probability that user will buy a similar product in the future. Assume that Jenny loves sci-fi books and her favourite writer is Walter Jon Williams. If she reads the Aristoi book, then her recommended book will be Angel station, also sci-fi book written by Walter Jon Williams. 

Picture 2 – Content based recommender system

Collaborative filtering in practice gives better results then content based approach. Perhaps it is because there is not as much diversity in the results as in collaborative filtering. 

Disadvantages of content based approach:

  • There is a so-called phenomenon filter bubble. If a user reads a book about a political ideology and books related to that ideology are recommended to him he will be in the “bubble of his previous interests”.
  • lot of data about user and his preferences needs to be collected to get the best recommendation
  • in practice there are 20% of items that attract the attention of 70-80% of users and 70-80% of items that attract the attention of 20% of users. Recommender’s goal is to introduce other products that are not available to users at first glance. In a content based approach this goal is not achieved as well as in collaborative filtering.

Collaborative filtering

The idea of ​​collaborative filtering is simple: User group behavior is used to make recommendations to other users. Since the recommendation is based on the preferences of other users it is called collaborative. 

There are two types of collaborative filtering: memory-based and model based.

Memory based

Memory based techniques are applied to raw data without preprocessing. They are easy for implementation and the resulting recommendations are generally easy to explain. Each time it is necessary to make predictions over all the data which slows down the recommender.

There are two types: user based and item based collaborative filtering.

  • User based – “Users who are similar to you also liked…” Products are recommended to the user based on the fact that they were purchased / liked by users who are similar to the observed user. If we say that users are similar what does that mean? For example, Jenny and Tom love sci-fi books. When a new sci-fi book appears and Jenny buys that book, since Tom also likes sci-fi books then we can recommend the book that Jenny bought.
Picture 3 – User based collaborative filtering recommender system

 

  • Item based – “Users who liked this item also liked…” If John, Robert and Jenny highly rated sci-fi books Fahrenheit 451 and The time machine, for example gave 5 stars, then when Tom buys the book Fahrenheit 451 then the book The time machine is also recommended to him because the system identified books as similar based on user ratings.
Picture 4 – Item based collaborative filtering recommender system

 

How to calculate user-user and item-item similarities?

Unlike the content based approach where metadata about users or items is used, the collaborative filtering memory based approach user behavior is observed, e.g. whether the user liked or rated an item or whether the item was liked or rated by a certain user.

For example, the idea is to recommend Robert the new sci-fi book.

Steps:

  1. Create user-item-rating matrix
  2. Create user-user similarity matrix 
    • Cosine similarity is calculated (alternatives: adjusted cosine similarity, pearson similarity, spearman rank correlation) between every two users. In this way a user-user matrix is ​​obtained. This matrix is smaller than the initial user-item-rating matrix.
  3. Look up similar users
    • In the user-user matrix, users that are most similar to Robert are observed
  4. Candidate generation
    • When Robert’s most similar users are found, then we look at all the books these users read and ratings they gave.
  5. Candidate scoring
    • Depending on the ratings, books are ranked from the ones that Robert’s most similar users liked the most, to the ones they liked the least.
    • The results are normalized ( on a scale from 0 to 1)
  6. Candidate filtering
    • It is being checked whether Robert has already bought any of these books. Those books should be eliminated because he has already read it.

The calculation of item-item similarity is done in an identical way and has all the same steps as user-user similarity.

Comparison of user-based and item-based approaches

The similarity between items is more stable than the similarity between the users because the math book will always be a math book, but the user can change his mind, e.g. something he liked last week he might not like next week. Another advantage is that there are fewer products than users. This leads to the conclusion that an item-item matrix with similarity scores will be smaller than a user-user matrix. Also item-based is a better approach if a new user visits the site while the user-based approach is problematic in that case.

Model based

These models were developed using machine learning algorithms. A model is created and  based on it, not all data, gives recommendations, which speeds up the work of the system. This approach achieves better scalability. Dimensionality reduction is often used in this approach. The most famous type of this approach is matrix factorization.

Matrix factorization

If there is feedback from the user for example, a user has watched a particular movie or read a particular book and has given a rating, that can be represented in the form of a matrix where each row represents a particular user and each column represents a particular item. Since it is almost impossible that the user will rate every item, this matrix will have many unfilled values. This is called sparsity. Matrix factorization methods are used to find a set of latent factors and determine user preferences using these factors. Latent Information can be reported by analyzing user behavior. The latent factors are otherwise called as features.

Why factorization?

Rating matrix is a product of two smaller matrices – item-feature matrix and user-feature matrix. 

Picture 5 – Matrix factorization

Matrix factorization steps:

  1. Initialization of random user and item matrix
  2. Ratings matrix is ​​obtained by multiplying the user and the transposed item matrix
  3. The goal of matrix factorization is to minimize the loss function (the difference in the ratings of the predicted and actual matrices must be minimal). Each rating can be described as a dot product of row in user matrix and column in item matrix.
Picture 6 – minimization of loss function

Where K is a set of (u,i) pairs, r(u,i) is the rating for item i by user u and λ is a regularization term (used to avoid overfitting).

  1. In order to minimize loss function we can apply  Stochastic Gradient Descent (SGD) or Alternating Least Squares (ALS) . Both methods can be used to incrementally update the model as new rating comes in. SGD is faster and more accurate than ALS. 

Hybrid recommenders

They represent a combination of different recommenders. The assumption is that a combination of several different recommenders will give better results than a single algorithm.

Recommender systems metrics

Which metric will be used depends on the business problem being solved. If we think that we have made the best possible recommender and the metric is great, but in practice it is bad, then our recommender is not good. Netflix recommender was never used in practice because it did not meet customer needs. The most important thing is that the user gains confidence in the recommender system. If we recommend him the top 10 products, and only 2 or 3 are relevant to him, he will consider that the recommender system is bad. For this reason, the idea is not to always recommend top 10 items, but to recommend items above a certain threshold.

Metrics:

  • Acuracy ( MAE, RMSE)
  • Measure top -N recommenders:
    • Hit rate – First find all items in this user’s history in the training data; remove one of these items ( leave-one-out cross-validation); use all other items for recommender and find top 10 recommendations; If the removed item appear in the top 10 recommendations, it is a hit. If not, it’s not a hit.
    • average reciprocal hit rate (ARHR) – we get more credit for recommending an item in which user rated on the top of the rank than on the bottom of the rank.
    • cumulative hit rate – those ratings that are less than a certain threshold are rejected, e.g. ratings less than 4
    • rating hit rate – rating score for each rating is calculated in order to find which type of rating is getting more hits. Sum the number of hits for each type of rating in top-N list and divide by the total number of items of each rating in top-N list.
  • Online A/B testing – A/B testing is the best way to do online evaluation of your recommender system.

 

Recommender real world challenges

  • Cold start problem – a new user has appeared,  what to recommend?
    • For example top 10 best selling products
    • Top 10 products on promotion
    • The user can be interviewed to find out what he likes
  • A new user has appeared, how can a new product be recognized by the recommender?
    • Use content-based attributes
    • Randomly add new products to user recommendations
    • Promote new products
  • Churn 
    • Since user changes behavior over time, a certain dose of randomization should be part of recommender systems in order to refresh top N list of recommended items
  • Be careful not to offend anyone with the recommender
  • Be careful not to make discrimination of any kind
  • Avoid recommending items that contain vulgar words, religious and political topics or drugs

If you have any questions regarding this topic, or want to share some impressions – drop us an email. We will be happy to discuss this on a more detailed level! 🙂

Friday talks: EDA done right

Main challenges 

Although EDA is often observed as an initial step which should be straightforward, there are some challenges that could slow down and make this process poor and painful. Some of the challenges I have encountered so far are listed below. 

Poorly defined business problem (and not having the understanding of it). Not having a clear problem that should be solved can make you wander around without some specific goal, which can be positive and productive, but in most cases – you will feel lost and wouldn’t know what to do with all the data you have in your hands. On the other side, if you don’t understand what the main issues the business is facing are – you will have troubles extracting insights that are helpful, since you will focus in the wrong direction. 

Not having the right data (nor talking to the right person). Although the problem is defined and well-understood, not identifying the right datasets that should be used, or not having the chance to talk to the person which knows the data in detail could make the EDA a hell of a ride. Neither you, nor the client will benefit and be satisfied with the EDA results – and that is not what you want to obtain with this process. Make sure you have the right data, and you have the right “go-to” person, for every question related to domain clarification, data gathering and merging, etc. 

Messy data and (no) warehouse (causing defending attitude of the “go-to” person). In most cases the data will be messy. Foreign keys mismatches, no IDs to join the information from multiple sources on, wrong calculations, etc, etc. Sometimes when you try to merge some datasets, and find out there are differences in IDs, or duplicates, or something else, and you got to the person being in charge for data maintenance – that person may go rogue. They are focused on explaining the reasons of mismatch and mess, and not on  giving the directions on how to make things right – or even do it. Be clear with what you want to do – you want to clean up your data (and get help to do that, if needed) in order to present how data science could help in leveraging some process, not to point out the messiness and neglection of people in charge of data maintenance. 

EDA done on auto-pilot (reports being containers, not insights treasury). Sometimes the problem is that  EDA is found boring and oversimplified. It is done just to follow some defined flow, in order to say that  you have done it, and then jump straight into sophisticated and complex ML algorithms. Most of the problems can be solved in the early stages of EDA – it is not easy, though, but if done right – you’re halfway there. Next time you’re doing EDA – rethink your approach, in order to identify if you are skipping steps and doing it with half a brain, just because you find ML more interesting (which IMO is  unacceptable, thorough EDA and data understanding are prerequisites of  ML application).

Not having a big picture. Remember  what is the main purpose of EDA, and the goals you want to achieve through it. Not knowing why you do something will suppress your creativity, innovation and critical thinking. This results in extracting one-time insights. EDA per se is allowed, but it is more applicable and useful if you do it in order to facilitate future analysis and steps that will be taken. 

How-to EDA

In order to make this process understandable, I have tried to present some main steps and guidelines found in the image that follows (from the client-vendor perspective, but if otherwise, an analogical approach could be applied).

Business problem definition

If you want your EDA to make sense and have a purpose, start with the problem. In this step, the most important thing is to listen to what the client is saying. It often happens that they know which data is useful, but don’t have the expertise to utilize it. On the other side, maybe they  have tried to perform the analysis and solve the problem  by  using data –  manually, and your job is to help them speed up the process. In some cases, it may even happen that they have never joined information from different departments, and don’t have the overview. Many different scenarios could happen, and that is why it is important for you to listen and not make any preassumptions. Translating into an analytics problem means understanding if and how data analysis can help in solving the issue. Definitining main pillars of analysis stands for identifying perspectives of analysis that could be applied – what are the main entities/business  areas that could be analyzed and how they are connected. The main output of this step is to come up with the problem that should be better understood and finally – solved. 

Lessons learned: don’t make preassumptions and let the client communicate the biggest issues. 

Data sources identification

Sometimes, there can be hundreds of input sources coming from various systems and placed in different locations – the goal of this step is to identify which sources contain data that best describes the problem you want to model and solve. Not all sources are (equally) important. IMO, it is better to start small – filter some representative input dataset contained from a couple of different sources to perform tailored analysis, than to have a vast amount of (not investigated) data, not knowing where exactly to start from. Having big data can be good, it could help in having the data describing different areas of business, but at the same time – can be your worst enemy if you don’t have a focus nor know how to filter information needed at a time.

Lessons learned: don’t start with tens or hundreds of tables not knowing how to join them, or filter relevant information.

Set EDA baseline

Okay, to set one thing straight – doing EDA just to be compliant with some methodology sucks. EDA is the main prerequisite for a fruitful and successful analysis, based on data, statistics and machine learning. Doing EDA without purpose or clearly defined goals will make it painful, useless, and overwhelming. There are crucial points to be defined as a baseline for doing EDA:

  • defining a business problem (e.g. high churn rate, or poor CAPEX planning)
  • defining purpose of the analysis (e.g. getting familiar with the data, main relationships among data, understand predictive power and quality for future analysis)
  • defining goals of the analysis (e.g. extract insights from the data describing the most affected pillar within the business, possible directions of improvements etc.)
  • defining working infrastructure (e.g. sometimes initial  dataset has millions of records, which requires working environment ensuring that data manipulation is possible and does not take a lifetime)
  • defining stakeholders – main people that should be involved (go-to person(s) for data, and key people that could gain insights and benefits from the analysis)

Lessons learned: make sure you have all the prerequisites satisfied – business problem, purpose and goals, working infrastructure and stakeholders

Perform EDA

Be creative and utilize everything you picked up from the first step – business problem definition. Think about everything you have learned so far, from your own experience. Use analogy – although there are different businesses with their own functioning mechanisms, it often happens that some analysis you have performed in one use case, can be applied to another.

There are two main purposes of exploratory data analysis:

  • getting to know the data, understand the business through the  data and gain an impression on how this data can be used to take advantage of
  • present insights that should either confirm or refute the current business performance belief and reflect a story on how this data can be used as a baseline of creating a sophisticated solution that would leverage the operational and strategic processes

In order to do that, one has to understand that although, for example, extracting correlations and visualizations are a must-have, and a helpful tool – they are not to be analyzed by the clients. You are creating the analysis for yourself, but in order to tell a story (to the client) based on that analysis. A report is not only a container with tables and graphs, but a utility guiding the reader and telling a story which reveals insights, irregularities and directions for improvements, characterizing the use case (business problem) being defined. So – next time you create an EDA report, ask yourself – what is the value of this report? It is useless if you don’t have a basic understanding on how and why you have done it. 

Lessons learned: create a story that will guide the reader/listener through the analysis, from the problem  setup, to the methodology and finally – insights. 

Present EDA Insights

This is your moment to shine. When you present EDA insights – you have to make a point. Why is it useful, what are new learnings obtained – how that can be used for future analysis and modeling. In most cases, some things that are weird or unexpected to you – are a completely regular thing to the clients, since they know much more about their business. And sometimes it happens the opposite. The idea is to use EDA as a guideline for defining next actions and use case realization. Collect feedback on the analysis and insights presented – sometimes some enrichments, further data cleaning and modifications should be introduced. 

Lessons learned: make a point (or multiple ones) with your analysis, and collect feedback on the analysis you have performed. 

The real deal – referring materials

I have found this comprehensive list of automated EDA libraries, and have used some of them myself (all time favs: Pandas-profiling, Sweetviz, and Yellowbrick). Additional links can be found in the following list:

  1. One from my early days: https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
  2. Priceless list of questions to ask prior and within EDA: https://levelup.gitconnected.com/cozy-up-with-your-data-6aedfb651172
  3. Feel the power of  Sweetviz: https://towardsdatascience.com/powerful-eda-exploratory-data-analysis-in-just-two-lines-of-code-using-sweetviz-6c943d32f34
  4. Not completely EDA, but related to: https://medium.com/data-science-community-srm/machine-learning-visualizations-with-yellowbrick-3c533955b1b3
  5. If you want to see  how others do that: https://towardsdatascience.com/creating-python-functions-for-exploratory-data-analysis-and-data-cleaning-2c462961bd71

Tell me about your experience, I would like to hear your best  practices and overcomings of often-faced challenges.a, 

Cheers! 🙂

Cover photo taken from: https://unsplash.com/@clarktibbs

FRIDAY TALKS: FRIENDS OR FOES? Propensity to purchase vs. Survival analysis

Retail industry. In the glory of Data Science, it’s all about the data and tailor-made targeting. If you want to brag about it, you would say – I’ve got the unique, omni-channel, 360-something, that can perfectly model customer’s behaviour and even go to Mars. What really happens is that you literally feel lost. There are plenty of different models, lots of useful and noisy data, challenges regarding resources, competition, expenditures,… How do you handle them all? Well, that’s not the topic of this post, but will certainly be one in future posts.

Modeling customer’s behaviour is a tricky task. Customer’s habits and preferences may change over time, you always get stuck between earning more money and retaining a given customer base. And you have to keep in mind that you’re in a competitive market, so if your product is pure shit, too expensive, or poorly communicated – you’re doomed. Thus – it’s not only the customer, but the total business process, including several departments – and everything has to be coherent and harmonized in order to meet the best results. Pretty ambitious right? 😀

Whenever you have ambitious plans, the successive realization is a good way to go. “Think big, act small”, they say. That is what Things Solver is aimed at. We always divide a problem into minor problems. Modeling customer’s behaviour is a complex task, and should be handled carefully, from different (let’s say 360, just to keep up 😀 ) angles and through several phases.

To frame it, if you set a goal, for example – to improve customer experience – you know where to start from. The customer is ready to pay for something, if that something is going to meet his expectations and fulfill his needs. How to come up with that something? Data science is there to help.

There are several key components that you need to analyze:

  • characteristics (customer information and demographics),
  • activity level (purchase frequency, recency and trends),
  • habits (regarding market basket, time, spending,… ),
  • preferences (regarding manufacturers, stores, materials,…).

This is a very demanding and comprehensive analysis. It includes various modules, from lead generation, scoring and segmentation, through customer lifetime value estimation, propensity modeling and survival analysis, to the market basket analysis and recommender systems. If you know how to consolidate their outputs, you may say you’ve found the holy grail of tailor-made targeting. Since each of these modules is a separate area of research and development, there is enough material to dedicate a whole blog post to each one of them.

What I will talk about is a small fragment of a puzzle, a very attractive problem in customer behaviour modeling, that deals with the customer’s propensity and activity level. I will talk about two different approaches that are often misinterpreted as independent fields of analysis. Those are: propensity to purchase and survival analysis.

 

Propensity to purchase

 

What machine learning often relies on is that there are some hidden patterns in behaviour that can be identified and insightful. Identification of those patterns in customer behaviour can be pretty valuable for campaign management and focused targeting. Measuring the probability that a customer will make a purchase in some future period of time is explained as customer’s propensity to purchase.

Why is this important? Well, each customer has some tendencies towards purchasing. If we can measure the probability that a customer will make a purchase, we can form our campaign in regards to that. Targeting customers that will certainly come can result in costs. Not targeting the sleepy customers will result in attrition and even higher costs.

Propensity to purchase analysis includes analyzing customer, transactional and internal data. This is important, because we want to understand all the circumstances and components that affect customer’s decision to come to the store (or visit a site). It is important to analyze customer’s data, if it is available, because teenagers and married couples may have different patterns of behaviour. If available, it is also important to track the behaviour online, and to join it with offline behaviour to get a complete picture.

On the other hand, transactions are a treasury of information. They can show what the habits and preferences are, without a single word from a customer. Internal data can provide additional information about availability, supplies, actions and discounts.

In this analysis, there are plenty of features that should be included. These features should reflect customer habits and activity level. It is important to analyze frequency, recency, amount of money spent, but it is also important to include interpurchase time and purchase trend. Speaking of retail industry, there are also seasonalities, like holidays and seasonal sales.

At Things Solver, we are closely working on this comprehensive analysis, trying to solve all the challenges encountered during the whole process. And there are lots of them. Some are technical, like how do you/can you identify a customer (this is important if you want to gather any demographic information), while some are analytical, like should you analyze all customers at once, or how much history to include, etc…

Modeling customer’s propensity to make a purchase is a very demanding task, but can be pretty helpful. Although, there are some drawbacks like – when is the purchase going to happen? Or, if this customer is not going to make a purchase in the period observed, is he going to buy in some future period? And that is when the survival analysis is coming to the stage.  

 

Survival analysis

 

The survival analysis, as many other approaches, finds its roots in medical research and biological studies, and it represents the time-to-event modeling. You get the idea, right? The survival analysis is dealing with estimating the time period from an action to a given event. Magnificent!

Although pretty intuitive and sharp, survival analysis is a complex task. And it is more powerful than you can imagine. The main goal of this analysis is to estimate the time to a given event, and to quantitatively explain how this time depends on various properties of the treatment, customers and other variables. What is the event? Well, in our case it’s a purchase. What is the treatment? The promotion we’re targeting the customer with.

Why is this analysis so adorable and powerful? Because it solves some of the main drawbacks we encountered in the propensity to purchase modeling. First, the propensity to purchase, as said, only gives the probability of making a purchase, but we want to know when it will be. Second, in the propensity to purchase modeling, there are lots of unknown or missing outcomes (the customer hasn’t made another purchase by the time we’re observing the data, which does not mean he won’t do it in the future), which can be a problem when dealing with classification tasks. The records (customers) that we don’t know the outcome for are called censored records, and the survival analysis successfully deals with them.

The core of this analysis are two functions, survival and hazard function.

The survival function is defined as the probability that an individual “survives” from the time origin (time of some trigger event) to time t. The value of the survival function at time point t corresponds to the fraction of customers who have not yet experienced the event at that point. While the survival function focuses on the probability of the event not happening, thus – the survival time; the hazard function describes the “risk” of the event, which is more convenient for out case.

The hazard function is defined as the probability of the event in an infinite small time period between t and t+dt, given that the individual has survived up until time t. In other words, it’s the probability that an event will happen in a particular time frame. If I made a confusion, I strongly recommend to consult Google, it will give you plenty of thorough explanations of those two terms.

The second advantage of this analysis is that it can model the time to an event of different groups that we want to target.  Some advanced techniques and extended parametric and nonparametric models can estimate the time to an event, giving a set of features like demographic and behavioral properties, or targeting information. And it can also be focused on an individual. If you want to learn more, take a look on this Python library: https://lifelines.readthedocs.io/en/latest/index.html.

 

1+1=3

 

My co-worker Strahinja always stresses that the key to finding the best solution is to try the hybrid approach. Once we had a lecture at the university, and a big heading was written on the presentation slide. “1+1=3”. It was inspired by the team work and united force. That is what we want to obtain. Let’s combine the best of both approaches, and create a higher value (pretty similar to the Master algorithm story, right? 🙂 ).

How can these approaches help us in campaign management optimization? If we have the probability that a customer will purchase (propensity to purchase), and if we have estimated time to that event (survival analysis), we can easily plan the timestamp and frequency of targeting. We don’t want to swamp our customers with thousands of promotions they are not interested in, or if they simply don’t have a custom to purchase in a given period of time. Other modules like segmentation, recommender system and CLV can help is in tailor-made campaign creation.

Notice that campaign management can be successfully obtained if these modules are comprehensively developed. In a series of blog post, my colleagues and I will try to explain these modules in more detail, so stay tuned! Cheers! 🙂

FRIDAY TALKS: WOMEN IN DATA SCIENCE

Hello, fellas. Cool down, I’m not going to talk about extreme feminism and gender (in)equality. 🙂 This post is going to be about extraordinary women I had a chance to meet at the Women in Data Science conference, held in Subotica, this April. I truly believe that these girls deserve to be heard of, as well as many others, so this is only the beginning. 🙂 As I mentioned before, the list of inspiring people with great influence on me is just getting longer and longer, and I will keep you updated. 🙂

The Woman in Data Science is initiative that finds its origins at Stanford University, Institute for Computational & Mathematical Engineering. With the idea to inspire, educate and support, they gather enthusiasts, academics and leaders with one main goal – to establish a baseline for all women willing to enter the field of Data Science and have the influence to make a change. The first conference was held back in 2015, at Stanford, and from then, it has grown to be a global conference with more than 150 regional events, including the one held in Subotica. I cannot express the gratitude and the pleasure I had to be there and meet all these great women!

The conference was organized as a one-day event, located in InfostudHub. The introductory talk was given by Tatjana Kecojevic, who had a major role in the whole initiative and organization of the event. Tatjana has a strong theoretical and technical background in the field of Data Science, regarding statistics and R. As a founder of  R-ladies Manchester and co-organizer of R-Ladies Belgrade and R-Ladies Montenegro, she has shown a great initiative to empower and support all data science enthusiasts, especially women. She really is an expert in this field, and if you’d like to find more, check out her GutHub page, there is a lot that can be learnt from it! And yes, the site of the conference was made in R! 😉

Olivera was the first speaker introduced to participants. Olivera Grljević is an Assistant Professor at the Faculty of Economics in Subotica. With many years of experience in the field of research and Data Science, especially text analysis, she was talking about the current situation at universities, focusing on the University of Novi Sad, regarding the initiatives and ongoing projects.

Katica Ristić is someone I knew from LinkedIn, and as soon as she entered the room, I knew who she was. Her enthusiasm is hard to ignore. Katica was talking about the mystique world of Data Science, and her incredible path from being a mathematician and a teacher, to being an applied mathematician and a Data Scientist. She tried to explain her learning process and present a course she is currently enrolled in.

Bojana Milašinović is a young woman with great initiative to empower other women to enter the world of IT. I had the opportunity to meet her last year, at the ENTER conference. She was talking about her transition from the culture and art, to the IT. This is important especially in the field of Data Science, where the transitions could be various, and can come from multiple areas of science and research. Many women are afraid of taking that transition and facing a strong barrier regarding that switch, thus her example is a great inspiration for everyone willing to make a change.

Katarina Kosmina is someone recognized and affirmed by the community, as an inspiring activist, researcher and coordinator involved in multiple projects regarding the digital transformation, open data and education. If you ever visited some meetup, hackathon or other form of community event if the field of Data Science, the chances are that you’ve met her. 🙂

Having a PhD in mathematics, several years of academical experience at the University of Novi Sad, as well as in industry,  Milena Kresoja is dealing with serious advanced analytics projects, regarding the financial analysis and services. She is currently working on a platform for quantitative financial analysis, where her main focus is outlier analysis in financial time series.

Professor Nadica Miljković is someone I haven’t had a chance to meet in person, yet. But, can’t wait to do  that! 🙂 Her main focus of research is related to biomedical engineering, and she has multiple successful and interesting projects regarding the biosignal processing. She is also one of the organizers of R-Ladies meetup in Belgrade. Nadica was a moderator during the panel that I had a chance to participate in, as well.

All these great and successful women that I previously mentioned left a strong impression on me, and I believe, will leave a huge mark on my career as well. They all are extraordinarily modest, kind and supportive. They are experts and experienced leaders in their niches. And I was really surprised and delighted to see that there is something serious going on in the field of Data Science outside Belgrade! 🙂 At the discussion we had, we concluded that women need empowerment. From another women and the community itself. Not because of the existence of active discrimination. But because not everybody is brave enough to stand out. And this is a delicate problem with complex factors. There are lots of things that need to be changed, from the raising during the childhood, through education at schools and universities, to activities in the community. And that is what we really need, especially in Data Science. We desperately need variety and diversity. And we need to encourage it. Not only regarding the gender, but the age, ethnicity, disciplinarity, and lots more.

The conference has a major message, including the answer to the question – what is the first step? The first step is to educate yourself and to find the path you want to follow. And to be persistent. And the task for the community should be to provide all the necessities you need in order to do that. As a community, we need to establish a baseline for sustainable development and growth. Sharing is caring, and sharing the knowledge makes it grow bigger and bigger. Empowering diversity ensures multiple perspectives and awakens critical thinking, which leads to rational reasoning and objective conclusions. Let’s be responsible and compassionate. Great conference, great people, can’t wait to go back there! 🙂

Down below, you may find myself enjoying the conference. 🙂 Cheers!

Friday talks: The Holy Grail of Machine Learning

Pedro Domingos is a professor of computer science and engineering at the University of Washington. He is a winner of the SIGKDD Innovation Award, which FYI is the highest honor in Data Science. They say that approximately seven years is the time period needed to become an expert in the field of Data Science. This man held one of his first lectures about Artificial Intelligence back in 2001. He has decades of research, hard work and engagements behind himself. He is the author of The Master algorithm book. Since I like this book a lot, I would like to share the main impressions in this blog post. I know that we all worship Andrew, but as for me, I have found yet another god here on Earth. Pedro Domingos is a name that I missed to mention in my previous post. And there’s a hole bunch of important names that totally slipped my mind. But fortunately, I have a chance to make it right. 🙂 If you liked the story about the hedgehog and the fox, stay tuned, because it’s going to be even more interesting! 😉

The five tribes of machine learning

In his book, Pedro is talking about something he calls – the machine learning tribes. What is it all about? I bet you can anticipate it. As in every other field, there are different streams that you can follow. And that depends on you beliefs and preferences. Now this is the point where this whole discussion about generalist and specialists escalates. So, it’s not only about the simple partition to generalists and specialists, but also about the paradigms you pursue, which kinda makes things even more complicated. 😀 Or we may call it – interesting. It’s like I always knew that there are different mindsets and approaches, but couldn’t quite draw the exact line between them. Pedro lists following basic directions: connectionists, evolutionaries, bayesians, analogizers and symbolists. Before digging deeper into these terms, let’s have a general say. Each approach finds its origin in different fields of science and research. Each one of them has its own master algorithm, the general purpose learner that can (or at least – it is believed it can) be used to learn everything. If you give it enough data. And computer power. And some more data. And although you may gravitate to one of these approaches, you cannot tell for sure which one is the best or the most powerful. Each one of them has its fortes in the specific problem domains. On the other side, each one of them has its own shortcomings.  Let’s take a walk through these “schools of thoughts” in machine learning.

Connectionists

This is maybe the most popular approach nowadays, since it includes neural networks and thus, deep learning. Connectionists believe that problems need to be solved the way humans do that. So they find their baseline in neuroscience, trying to emulate the human brain. To be precise, connectionists get nuts when someone says that neural networks are emulating the human brain. Because they aren’t. Human brain is much more complex, and cannot be easily emulated, since there are many things yet to be discovered. Neural networks only use human brain as an inspiration, rather that the model which is blindly being followed. They consist of computing units – neurons, that take inputs, calculate their weighted sum, pass the result through some activation function and feedforward that output to the neurons in the following layer. The general idea is simple – take this input, let the neurons perform magic, and return some output. But how do the neural networks learn, how do they generate the output they should? By using backpropagation, which is regarded as their biggest advantage. The error is propagated back from the output to the previous layers of network to perform tweaking of the weights, in order to make the error smallest possible. And they really are powerful and widely used. Image recognition, cancer research, natural language processing, you name it. But there still are some shortcomings, like the amount of data required. Or lack of interpretability. Long way to go, but connectionists believe they might become the all-mighty algorithm once.

Evolutionaries

The evolutionaries go one step further than connectionists. They claim that the evolutionary process is far more powerful than human reasoning process, since, in the end, the evolution has driven reasoning as it is nowadays. They take roots in evolutionary biology. There are several most famous evolutionaries: John Holland developed genetic algorithms, John Koza developed genetic programming, while Hod Lipson continued to this development through evolutionary learning. How does the genetic algorithm work? At the beginning of the process, we have a population of individuals. In the centre of attention is a genome, which describes each individual (in computation represented in bits). So, each individual is evaluated for the specific task it should be solving, so that best fit individuals have bigger chances to survive and be the parents of the next generation. In this process genomes of two individuals get combined to create a child genome, which contains one part of a genome from each parent. Like in evolution, random mutation of genomes can happen, and so it is in the evolutionary learning, so we practically get a new population in each generation. This process is iteratively done until we get a generation with best fit individuals able to solve the problem optimally. While the connectionists approach is only adjusting weights in order to fit a fixed structure, the evolutionaries approach is able to learn the structure, to come up with the structure itself.

Bayesians

Now, the Bayesians find their origins in statistics. Since everything is uncertain, they quantify the uncertainty by calculating the probability. They bow to Bayes theorem. But how is their learning process being conducted? At first, they define some hypotheses. After that, they calculate the prior probability for each hypothesis, meaning how much do they believe the hypothesis is true before knowing anything about it. Pedro notices that this is the most controversial fact in their learning process. How can you pre-assume something given no data? But as the evidence comes in, they update the probability of each hypothesis. They also measure the likelihood, which tells how probable is the evidence, given the hypothesis is true. After that, they can calculate the posterior probability which tells how probable is the hypothesis given the observed evidence. Thus, the hypothesis consistent with the data will have increasing probability and vice versa. The biggest forte of this approach is its ability to measure the uncertainty. And that frequently is the problem. Generally speaking, the new knowledge we generate is uncertain at first, and it is good to quantify that uncertainty. Maybe you’re not quite aware of it, but self-driving cars have Bayesian networks implemented inside, and they use them for the learning process. Some of the most famous bayesians are Judea Pearl, who developed powerful Bayesian networks, Davin Heckerman and Michael Jordan. It is said that bayesians are the most fanatic of all the tribes, so I wouldn’t mess with them, to be sincere. 😉

Analogizers

The basic idea brought by the analogizers is reasoning by analogy. To transfer the solution of the known situation to a new situation faced. As you may refer, they find their origins in several areas of science, but mostly in psychology. Peter Hart is one of the most famous analogizers, since he dealt a lot with nearest neighbour algorithm. Vladimir Vapnik is the inventor of the support vector machines, known as the kernel machines. As another analogizer, Douglas Hofstadter has been working on more sophisticated topics, he presented in his book. Pedro says that in one of his books, Douglas even spent five hundred pages just arguing that all of the intelligence is pure analogy. Now, maybe we need to reconsider the fanatic ones here. 😀 So, basically, what they claim is that we can expand our knowledge by investigating new phenomena through driving the analogy with other known phenomena. The most famous application of analogy based learning are recommender systems. And it really does make sense! If you what to determine what to suggest next to a customer, then check what similar customers liked, and analogically place similar offers to the given customer. The best competence of the analogizers is the ability to generalize from just a few examples. We often encounter unknown problems, and their approach by learning using the similarity is a good problem solver in those cases. Simple as that, ain’t it?

Symbolists

The symbolists find their origin in logic and philosophy. Their main purpose is filling in the gaps in the existing knowledge. Learning is the induction of knowledge. The induction is basically the inverted deduction. General rules are made of specific facts, so the process of induction includes going from specific facts to general rules. But not only that.  Based on some known rules, they induce new rules, combine them with the known facts, and raise questions that were never asked before, which is leading to new rules and answers that enrich the knowledge more and more. Practically, they start the process with some basic knowledge, then they formulate hypothesis, design experiments, carry them out and given the results, hypotheses are refined or new hypotheses are generated. This is something closest to the way in which the scientists generally approach the research. Thus, this approach is mostly used in robot scientists creation. One of the most famous Pedro’s examples of inverse deduction is molecular biology robot, Eve. Eve has found the new malaria drug, by using the  inverse deduction. Some of the most prominent symbolists that Pedro numbers are Steve Muggleton, Tom Mitchel and Ross Quinlan. The biggest advantage of this approach is that they are able to compose the knowledge in many different ways, by using logic and inverse deduction.

To sum it up.

Approach Problem Solution
Connectionists Credit assignment Backpropagation
Evolutionaries Structure discovery Genetic programming
Bayesisans Uncertainty Probabilistic inference
Analogizers Similarity Kernel machines
Symbolists Knowledge composition Inverse deduction

One algorithm to rule them all…

What Pedro states is that eventually – it’s not about the partition itself. Is about finding the unique solution. The universal learner. The all-mighty model. The Master algorithm. So, each of these five schools has its own advantages. Can we take what’s the biggest forte of each one of them, combine them, and get the master algorithm able to solve each machine learning problem we give it? Pedro states this is the goal. Is it to complicated? Doesn’t necessarily have to be. He divides all these approaches into three main modules: representation, evaluation and optimization. The representation tells us how the learner represents the learning process. And in most cases, this will be reflected in common sense logic (but it could be in differential equations, polynomial function, or whatever). In the book, the unification of this process comes by combining the symbolists and bayesians approach. Since symbolists use logic, while bayesians use graphical models – their combination can represent any type of problem one can encounter. The evaluation is used for measuring the performance of the model (in terms of pattern recognition, data fitness, generalization, etc.). The master algorithm should be able to take any of the evaluation functions available in these five approaches, and provide the possibility for user to decide which one of them will be used. The optimization is the task of finding the  best-fit model for a given problem. This includes discovering formulas that will mathematically describe a given problem, by using the evolutionaries genetic programming approach, and optimizing the weights in these formulas, which can be solved with the backpropagation algorithm used by the connectionists. Pedro believes that the master algorithm will be able to solve any problem given. Most practical examples include home robots, cancer cure(s), 360-view recommender systems, etc. The list goes on and on.

I would like to close this post with his inspiring words:

Scientists make theories and engineers make devices. Computer scientists make both theories and devices.

Ain’t it? The book is pretty concisely written, and can be read in a breath! Please accept this book as my warmest recommendation for a good read, and just enjoy it. Hopefully it will be as inspiring to you as it is for me. Share your impressions, can’t wait to hear them! :3 

Friday talks: A Data Science Project

This post is not going to be about another Data Science course you should enroll in. It’s not going to be about various skills you should build in order to develop a Data Science project, either. Considering the title of this post – A Data Science Project – I tried to create a pun. Your journey to the destination called “I am a Data Scientist” is a project you should be working on, with phases, iterations, and disagreement between the user requirements and generated outcomes. I would like to talk about my Data Science path, and what it is like to be a Data Scientist from my perspective. I can assure you that there are tons of blog posts on the web that are sharing the same topic, enriched with more information and experience than my own, but the thing is – I want to talk about heading this way and share with you some unconventional directives that made my journey a lot easier, and hopefully, would do the same for you.

So, regarding the beginnings, there are some baby steps you should make, in order to build basic skills needed for the purpose of analysis and extracting insights. And you really do it well. The beginnings are no longer a problem. Most of you start with the Machine learning course held by the incarnation of a deity in the world of machine learning, Andrew Ng. Or with DataCamp. Or at Kaggle. And that’s the right way to do it. But there are some additional activities you can practice, that will make it a lot easier for you to master this field and/or to enrich your experience and spread your collection of skills.

1. Research & Blogs

Being a Data Scientist requires lots of research. In order to extract the most possible from the data, you should be aware of the limits. And the limits are constantly changing. How to know where the limit is? By doing some serious research! Follow what’s the academia doing, but also how is the implementation going in the industry. Besides academic research and scientific papers on some particular subject, I read lots of blogs on a daily level. Some of the blogs that I personally like are Analytics Vidhya, Towards Data Science, Machine learning mastery, Brandon Rohrer’s blog, and Colah’s blog. Or, you can install Flipboard, set your topics of interest, and follow up.

2. Meetups & Conferences

Communities and gatherings are some precious things in this field. Lots of enthusiasts and experienced people can be found on such events, sharing their knowledge and findings. At Things Solver, we really believe in the “sharing is caring” idiom, and with that in mind, we try to share our knowledge and to let it grow even more through these events. There are many meetups in Serbia with Data Science, AI, and related topics, so you can start with exploring the Meetup.com and areas you’re interested in. The most popular Data Science community is Data Science Serbia, organizing meetups, usually encouraging bonding and networking of Data Science enthusiasts. As for the conferences in Serbia, the most popular one for certain is Data Science Conference, growing bigger each year.

3. Social networks & Influencers

Social networks are a good way to follow the activities and events, even though you’re not able to be there physically. What I really use on a daily level is LinkedIn. There are some inspiring people that I follow and learn from, like Favio Vázquez, Brandon Rohrer, Jason Brownlee, Andriy Burkov and many, many more.

People are often underestimating these things, but they really are a crucial part of a continuous Data Science path. And that is one of the biggest problems one encounters at the beginning. Like every other field, it requires dedication, research and lots of learning. And, since it is continuously growing, one should simultaneously grow alongside, in order to be at the top, comfortable with the cutting edge technologies. And, to be honest, that is not easy.

The wanderer’s puzzle

The first thing I want to discuss is something I call “the wanderer’s puzzle”. And I want to open this section with the Tolkien’s words ”Not all those who wander are lost…”. So, entering this field (or any other field), you’re probably feeling lost. But what’s the right thing to do Data Science? It depends. There is no such thing as a recipe with perfectly determined doses of ingredients. The first thing is to wander. To find yourself. And I have a really interesting story to share with you, called The Hedgehog and the Fox. My dear colleague Anđela shared this story with me, and it really helped me find myself. I pulled the analogy regarding this topic and Data Science. You have to determine whether you’re a hedgehog or a fox. It depends on your interests. You can either be a hedgehog, focused on mastering one thing, or a fox, squirming thought various domains at the same time. Regarding the Data Science, I know many colleagues that are totally hedgehogs (they are experts in computer vision, for example, but they have never heard of Isolation Forest or a Survival curve). And similarly, I have lots of colleagues who are foxes, they have played with CNNs, time series analysis, store optimization in various domains like marketing, finance etc.,  but they always say they haven’t yet dug any of these areas deeper.

The imposter syndrome

Another thing that I would like to talk about is confidence. Reading lots of blogs, listening to many technical courses and presentations, I’ve really had hard times believing in myself and building confidence and self-awareness. Never thought about the real problem I was facing – called the Imposter syndrome. So, the imposter syndrome… This is a situation where you’re doubting yourself, your competences and knowledge, afraid of being exposed or flagged as a “fraud”. This is a frequent problem, and lots of successful people are facing it. You know that there will always be someone with more experience, more knowledge, better competences. That’s not the problem. The problem is that you think you’re not good enough. That your acknowledgments are not yours, but the merit of someone else, or accidental series of happy circumstances. And that it’s only a matter of time when someone will break you and ruin your career and everything you’ve accomplished. I was lucky to have a conversation with a more experienced Data Scientist, who pointed me to this problem. And I have a perfect read on this topic here. So, stop doubting yourself and keep rocking the Data Science!

Development vs. production

When looking at the practice and the real-world application, there also are some key drivers you should be aware of, in order to keep up the trace and save your stamina. And that’s not something that you can easily learn or hear about just around the corner. Dealing with some real-world Data Science projects, I have learned one crucial thing. You should never (like, EVER !) look at the Data Science project development and production as two separate things. They are done in separate phases, they can be done by separate teams, they can eventually be separated by the environments and the conditions they are running in. But they should always be regarded as a whole, a unity, a completeness. Now, I know that you’re asking yourself – why would I possibly divide those things – yet again, I am sharing my experience, and yes, I made this mistake. And learned from it.

Each Data Science project starts with a problem that should be solved. The solution of the problem should lead to business improvements, reflected in revenue increase, cost reduction, or whatever the desired metric is. There are several phases in the development process, as well as in the deployment, and this flow is usually divided between several teams. Due to the numerous phases and iterations in the process, lots of things can happen, potentially leading to complications and project failures. It is unnecessary to emphasize that everyone involved should be completely dedicated and aware, for this process to be perfect. So, is there anything that you can do (or avoid doing), as a Data Scientist, in order to make this process as fluent as possible? Yes, for sure! In many cases, Data Scientists are described as lazy and messy. Why is that? We develop our models and test it in some environments that are not even IDEs, but some kind of a browser tab! We love the interactivity and line by line execution! And that really comes in handy during the development phase, when playing with the data and different models. We have a pretty narrow focus on finding the right model, putting everything else (like data withdrawal, code modularity, results delivery, etc. ) aside. The problem appears when you’ve chosen a satisfying model. You cannot just throw it around to the teammate assigned to the deployment, like it is a hot potato! And that’s the biggest issue in every project. In most cases, especially when you are rookie, models are not production-ready. And it can lead to lots of headaches. Data Scientists often neglect the steps that are coming after the model training phase. And that is pretty irresponsible and not aligned with the team spirit you should have! You have to think about production and model deployment. You have to communicate with the ones responsible for the model deployment. And, if it’s you that is also deploying the model into production, you should be responsible to yourself, too.

My most sincere recommendation is to always think about the whole process. The things you should always take into consideration are model scalability, generalization, adaptation, optimization, and additional tweaking. Write code that is readable and easily upgraded. Parameterize everything that is prone to changes. Develop models that can easily be enriched with more data. Create pipelines. And, even if you’re a researcher or a “lazy” Data Scientist in a team consisted of both Data engineers and Machine learning engineers, make sure that you understand the whole process, at least. You’re not an independent entity in the project. The process will be much faster and more efficient if you take these into account from the beginning. And not to mention the project flow and success rate. Finally, you are a Data Scientist. It’s not only about 95% accuracy. It is about the impact of the whole process. You have to understand why you’re doing it. But also how that is changing the environment you’re in. And that is much more satisfying than the 95% accuracy, to be real. If a model with 68% accuracy is driving the changes and creating the business value – I’ll totally be up to that!

There is one last thing I want to share with you. How do you continuously grow? The following are three very simple, but powerful steps I stumbled upon while browsing the net (check out the whole post here, it really is valuable).

Identify your weaknesses

Define a plan that should convert your weakness to your forte

Execute the plan

 

Simple as that, ain’t it? 🙂

 

How the Big Data Won the Hearts in Telecommunications

The fact that there is a deep connection between the telecommunications and Big Data is very clear – the main task for telecommunications is in exchanging data. Since the amount of data has enormously increased in the modern era, the experts in telecommunication companies needed some help from the specialised experts.
The need for the experts “that appear out of the blue and solve the matter” is not unfamiliar. A good physician would do the thorough check of the patient, collect data through different tests, set up a preliminary diagnosis, prepare everything for the surgery, draw the path to full recovery. But for the surgery itself, he would call an expert in the area, to assist in the procedure. That is the only way to secure that all the collected test results, findings, x-rays – i. e. data – are used in the most efficient way, and the patient should have the greatest use of those. “The specialist” would see in the data even what the experts for the other parts of the process cannot identify. Telecommunications giants Vip Mobile called Things Solver “the specialists” for Big Data and analytics – the task was to get the most out of the data they collect.

Few Months to See First Results

Vip Mobile Software Architect Goran Pavlovic says that the areas where the specialists’ help was the most needed was the analysis of key business operations segment. The task was to make the network capacity analysis, analyse interactions among the employees during incident and complaint solving process, analyse user interactions in the Web Shop.

Vip Mobile Engineer Djordje Begenisic was involved in the network capacity analysis. He remembers that the task was to make a tool that should help predict and keep track of the network performance and user experience. “It was all focused on the timely detection of capacity issues, and also on the identification of the users which are the least satisfied with the services. With that knowledge and proactive approach, we could notably decrease the level of user dissatisfaction”, Begenisic claims.
It did not take long before the first results. “In just a couple of months, we managed to notably increase the precision level in prediction. That also removed all the doubts in the power of Data Scientists”, Begenisic concludes.
Vip Engineer also adds that spending time with the data experts ready to dig deep was the added value of the whole process. At the same time, it was necessary to turn a joint effort into a success story.

The Road from Excitement to the Result

“Our cooperation begins with the opening excitement while you are presenting the problem, and options how to solve it”, Begenisic describes the joint efforts. “Data Scientist is carefully listening, and then makes a conclusion – which you probably do not even understand.”
“Each of the sides has the key knowledge the other needs in order to create a successful product – the telecommunication experts have the knowledge in that domaine, while Data Scientists bring in the knowledge in the areas of data processing, machine learning”, Software Architect Pavlovic says. Djordje Begenisic discovers us an interesting aspect of the story: “It is a great advantage even if Data Scientist is not fully into the domaine knowledge because there’s a chance to fully analyse the problem without any disturbance that can be brought into the process with the incomplete domaine knowledge.

Brainstorming remains the key part of the multidisciplinary process. Joint effort to look for ideas is the key to success. The result is also part of that joint effort – data science “practises” and perfects its own methods on telecommunication’s big data, while the telco experts have a chance to develop new skills in modern technologies through working with Data Scientists. It is a true win-win story.

Time series Anomaly Detection using a Variational Autoencoder (VAE)

Why time series anomaly detection?

 

Let’s say you are tracking a large number of business-related or technical KPIs (that may have seasonality and noise). It is in your interest to automatically isolate a time window for a single KPI whose behavior deviates from normal behavior (contextual anomaly – for the definition refer to this post). When you have the problematic time window at hand you can further explore the values of that KPI. You can then link the anomaly to an event which caused the unexpected behavior. Most importantly, you can then act on the information.

To do the automatic time window isolation we need a time series anomaly detection machine learning model. The goal of this post is to introduce a probabilistic neural network (VAE) as a time series machine learning model and explore its use in the area of anomaly detection. As this post tries to reduce the math as much as possible, it does require some neural network and probability knowledge.

Background

 

As Valentina mentioned in her post there are three different approaches to anomaly detection using machine learning based on the availability of labels:

  1. unsupervised anomaly detection
  2. semi-supervised anomaly detection
  3. supervised anomaly detection

Someone who has knowledge of the domain needs to assign labels manually. Therefore, acquiring precise and extensive labels is a time consuming and an expensive process. I’ve deliberately put unsupervised as the first approach, since it doesn’t require labels. It does, however, require that normal instances outnumber the abnormal ones. Not only do we require an unsupervised model, we also require it to be good at modeling non-linearities.

What model? Enter neural networks…

 

Historically, different kinds of neural networks have had success with modeling complex non-linear data (e.g. image, sound and text data). However, universal function approximators that they are, they have inevitably found their way into modeling tabular data. One interesting type of tabular data modeling is time-series modeling.

A model that has made the transition from complex data to tabular data is an Autoencoder(AE). Autoencoder consists of two parts – encoder and decoder. It tries to learn a smaller representation of its input (encoder) and then reconstruct its input from that smaller representation (decoder). An anomaly score is designed to correspond to the reconstruction error.

Autoencoder has a probabilistic sibling Variational Autoencoder(VAE), a Bayesian neural network. It tries not to reconstruct the original input, but the (chosen) distribution’s parameters of the output. An anomaly score is designed to correspond to an – anomaly probability. Choosing a distribution is a problem-dependent task and it can also be a research path. Now we delve into slightly more technical details.

 

Both AE and VAE use a sliding window of KPI values as an input. Model performance is mainly determined by the size of the sliding window.

Diggin’ deeper into Variational Autoencoders…

 

 

The smaller representation in the VAE context is called a latent variable and it has a prior distribution (chosen to be the Normal distribution). The encoder is its posterior distribution and the decoder is its likelihood distribution. Both of them are Normal distribution in our problem. A forward pass would be:

  1. Encode an instance into a mean value and standard deviation of latent variable
  2. Sample from the latent variable’s distribution
  3. Decode the sample into a mean value and standard deviation of the output variable
  4. Sample from the output variable’s distribution

Variational Autoencoder as probabilistic neural network (also named a Bayesian neural network). It is also a type of a graphical model. An in-depth description of graphical models can be found in Chapter 8 of Christopher Bishop‘s Machine Learning and Pattern Recongnition.

A TensorFlow definition of the model:

class VAE(object):
    def __init__(self, kpi, z_dim=None, n_dim=None, hidden_layer_sz=None):
        """
        Args:
          z_dim : dimension of latent space.
          n_dim : dimension of input data.
        """
        if not z_dim or not n_dim:
            raise ValueError("You should set z_dim"
                             "(latent space) dimension and your input n_dim."
                             " \n            ")

        tf.reset_default_graph()
        
        def make_prior(code_size):
            loc = tf.zeros(code_size)
            scale = tf.ones(code_size)
            return tfd.MultivariateNormalDiag(loc, scale)

        self.z_dim = z_dim
        self.n_dim = n_dim
        self.kpi = kpi
        self.dense_size = hidden_layer_sz
        
        self.input = tf.placeholder(dtype=tf.float32,shape=[None, n_dim], name='KPI_data')
        self.batch_size = tf.placeholder(tf.int64, name="init_batch_size")

        # tf.data api
        dataset = tf.data.Dataset.from_tensor_slices(self.input).repeat() \
            .batch(self.batch_size)
        self.ite = dataset.make_initializable_iterator()
        self.x = self.ite.get_next()
        
        # Define the model.
        self.prior = make_prior(code_size=self.z_dim)
        x = tf.contrib.layers.flatten(self.x)
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        loc = tf.layers.dense(x, self.z_dim)
        scale = tf.layers.dense(x, self.z_dim , tf.nn.softplus)
        self.posterior = tfd.MultivariateNormalDiag(loc, scale)
        self.code = self.posterior.sample()

        # Define the loss.
        x = self.code
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        x = tf.layers.dense(x, self.dense_size, tf.nn.relu)
        loc = tf.layers.dense(x, self.n_dim)
        scale = tf.layers.dense(x, self.n_dim , tf.nn.softplus)
        self.decoder = tfd.MultivariateNormalDiag(loc, scale)
        self.likelihood = self.decoder.log_prob(self.x)
        self.divergence = tf.contrib.distributions.kl_divergence(self.posterior, self.prior)
        self.elbo = tf.reduce_mean(self.likelihood - self.divergence)
        self._cost = -self.elbo
        
        self.saver = tf.train.Saver()
        self.sess = tf.Session()

def fit(self, Xs, learning_rate=0.001, num_epochs=10, batch_sz=200, verbose=True):
        
        self.optimize = tf.train.AdamOptimizer(learning_rate).minimize(self._cost)

        batches_per_epoch = int(np.ceil(len(Xs[0]) / batch_sz))
        print("\n")
        print("Training anomaly detector/dimensionalty reduction VAE for KPI",self.kpi)
        print("\n")
        print("There are",batches_per_epoch, "batches per epoch")
        start = timer()
        
        self.sess.run(tf.global_variables_initializer())
        
        for epoch in range(num_epochs):
            train_error = 0

            
            self.sess.run(
                self.ite.initializer,
                feed_dict={
                    self.input: Xs,
                    self.batch_size: batch_sz})

            for step in range(batches_per_epoch):
                _, loss = self.sess.run([self.optimize, self._cost])
                train_error += loss
                if step == (batches_per_epoch - 1):
                        mean_loss = train_error / batches_per_epoch   
            if verbose:
                print(
                    "Epoch {:^6} Loss {:0.5f}"  .format(
                        epoch + 1, mean_loss))
                
            if train_error == np.nan:
                return False
        end = timer()
        print("\n")
        print("Training time {:0.2f} minutes".format((end - start) / (60)))
        return True

Theory is great… What about real world?

 

Using the model on one of the data sets from the Numenta Anomaly Benchmark(NAB):

 

In this case the model was able to achieve a true positive rate (TPR = 1.0) and a false positive rate (FPR = 0.07). For various anomaly probability thresholds we get a ROC curve:

Choosing the threshold read from the ROC curve plot we get the following from the test set:

Just as the ROC curve suggested, the model was able to completely capture the abnormal behavior. Alas, as all neural network models are in need of hyperparameter tuning, this beast is no exception. However the only hyperparameter that can greatly affect the performance is the size of the sliding window.

I hope I was successful in introducing this fairly complex model in simple terms. I encourage you to try the model on other data sets available from here.

 

Keep on learning and things solving!

Recommender Systems and Banks: Precious Recommendation

The client’s path, from conceiving an idea to making it a project in the bank, used to be clear, but unpredictable. It involved a potentially noticed ad or the client’s own idea, a visit to the bank counter in person and the more or less successful deal with the bank. It was very time consuming, with very little control from the bank, and even less efficiency. It was very hard to accept this method in an industry that is proud of the motto “time is money”.

Online communication of a bank and a client seems even more chaotic at first sight. Random clicks on the websites in the search for the information, answers to the key questions, wandering around, looking for the needed service.
However, things do not have to look that way. Human behaviour in attempts to communicate with banks is usually all but chaotic.

First Step – Creating a System

The best cure for chaos is – implementing the order. In the online communication between clients and banks, that means identifying the options to get the best possible outputs from the system, relying on the inputs.
“Recommender Systems are used to provide the best recommendation of our product that would interest the client most (system output), based on the user data (system input)”, Things Solver expert for development and implementation of Recommender Systems Strahinja Demic explains, using the company’s definition.

These systems can be classified according to the system input, system output or according to the algorithms that operate in the background and create the recommendation. First two classifications are created in Things Solver and are based on the practical experience.
“In the classification based on system input, we can identify inputs from online (user visits to the website, for example), inputs from offline (user data kept in the bank’s database), or the mixture of both kinds of data. In the classification based on the system output, the recommender can propose products the user already experienced or those that he might be interested at, but still does not have them; or the mixture of both kinds”, Demic describes the systems.

The New Approach for the Banks

Recommender Systems made by Things Solver make their way into the banking systems through the online sessions data.
“An online session is defined as a client’s visit to the website and its entire activity on the website – for example, the path through the website, time spent at certain pages, choice of links. Based on that data, we try to describe what client actually wants – the client’s visit to the webpage for loans or his attempt to make a calculation can send us a signal that there is some interest for taking a loan”, Demic explains.

The outcome of the process is a package of five products the client analysed and five products he might be interested at, but had no direct experience with it so far – only the algorithm noticed the potential interest for the product.
“We can notice the interest even before they show it at one of the banks for at least 10 percent of clients, and that is not a small number. The percentage would grow if those clients would be contacted in order to maintain their interest”, and this is the moment when, as Demic explains, the managerial structures of banks became more interested for the Recommender Systems.

But this results do not make only the management happy. The other employees are satisfied as well – without many changes in the procedures, since people rarely welcome big changes, the results and efficiency are improved. This recommendation then definitely deserves the adjective “precious” in its description.

Data Science Academy Finals: „New Knowledge Is Always Welcome”

Students of the first Data Science Academy spent three months on a journey through the world of data science, big data and analytics. Although most of them are coming from the world where Data Science is not the hottest thing, they bring home many impressions from the journey. “The knowledge is never a burden, new knowledge is always welcome”, Vip Mobile Jovana Barjaktarevic described the three-month journey.

At the end of the trip, they were welcomed again by the organisers – CEO of Things Solver Darko Marjanovic, Data Science Serbia president Branko Kovac, ICT Hub executive director Kosta Andric. All students were given diplomas and all the lecturers received special congratulations because their joint efforts resulted in spreading the basic knowledge about Data Science.

“In the short period of time, we made the students very interested in the Data Science and they realised how it looks to be a Data Scientist. The biggest victory for me was the fact that they spread that knowledge and talked about it to their colleagues. For us as a company, it is very important to raise awareness and the level of knowledge in this area”, Darko Marjanovic says. “The first step was to gather people that get to work with Data Science and to analyse the problems together. During the past three months, we stayed longer and worked more, enjoying to see the participants and mentors at work”, Kosta Andric says.

Everybody that came to the ICT Hub for the final evening had the chance to enjoy the works of the Data Science Academy participants who presented their conclusions in three cases covered during the past three months. Each group tackled a different company’s problem and the groups were made out of participants from different companies, working at different positions.

Delhaize Serbia Case

The key task for this team was to find the solution for the question how could Delhaize Serbia make the most efficient sales of their products, how long the sales should last and what should be the level of discounts in order to attract the most buyers.
As in every learning process, the team first formulated the key question: “Based on our analysis, we realised that the key optimal combination is the duration of sale and the size of discounts – this should produce a successful sale”. Still, the learning curve has mistakes and troubles in to, so the method of linear regression produced the margin error of 36% and was then substituted with the Random Forrest Regressor, decreasing the mistake to 20 percent.
The team concluded that the best data is produces when only one specific chain of shops is separated from the sample and then combined with specific product categories, while the most efficient sales are those when the data is analysed on the daily basis.
Team Members: Srdjan Nesic, Bojan Baralic, Tamara Stanojevic, Zarko Milojevic.

Societe Generale Serbia Case

The team’s task was to discover how could the bank encourage more clients for using e-banking and m-banking. “Our task was to ask the right questions, get to know the whole subject, analyse data and test the model, making our conclusions at the end”.
They discovered how important it is to keep the data properly formatted and in order, placed in a single database. Using the Random Forrest Classifier method, they discovered that the precision of their conclusion is 73 percent.

The team also concluded that the age of the client is the most important criteria for recognising which client would turn from a passive to an active client of e-banking and m-banking.
Team Members: Marko Cebic, Tamara Rendulic, Dragana Rosic, Aleksandar Kovala.

Vip Mobile Case

The team faced 11,5 million lines of geolocated data on using the mobile network in a Serbian town. “We should have identified the movement of users, their habits, needs, expenses, the use of base stations, presence of foreign users and their habits, and finally – the potential for outsourcing the depersonalised data”.
First, the team learned that it is very hard to work with the big data, so they kept it separated in two databases. They learned also that different unexpected and new problems may emerge during the analysis, such as anomalies in user behaviour or fraud prevention.
They concluded what remains one of the main suggestions in the Academy – some of them will present the findings to their teams in their workplaces. So it seems that although Data Science Academy finished its “formal” first year of life, but what was said there will certainly continue the journey.
Team Members: Luka Turudija, Milica Tomasevic, Jovana Barjaktarevic, Goran Kukobat.

Big Data and Banking: How Our Data Protects Us

Some of the main tasks for the bankers are, among others, keeping, preserving, deriving more from less, taking the best out of what the clients deposit to them, with great confidence and trust. Almost the same definition could be applied to the Data Scientists dealing with the big data. They also deal with keeping, preserving, deriving more conclusions from a little information, getting the best out of the existing databases which they handle with great care and confidentiality. It is maybe because of this similarity in definitions, or maybe because of the fact that the databases in the banks are an ideal example on which the detailed analysis could be performed and the conclusions can be used to benefit both clients and banks that the financial industry is among the top users of Data Scientists services, according to the credible surveys.
One of the leading banks in the Serbian market, Banca Intesa has a client base of around 1.5 million. For Banca Intesa COO Aleksandar Stojadinovic, it is obvious that those databases should be treated as an important asset. When used properly, it could bring notable benefits for both clients and the companies.
As a part of the worldwide group whose headquarters is in Italy, Stojadinovic says that the use of big data in Serbia has never been closer to the world trends. “In the world of globalization, we have at our disposal all tools that are available to the most powerful companies in the world. The only thing that limits us is the lack of talents and specialists, as well as the size of the market and average living standards of the population.”
Banca Intesa decided to make a combination of their strong internal team and the experts of Things Solver in order to start a joint search for the solutions that can be applied right away.

Short Sprints Lead to Results

Stojadinovic describes the cooperation of data scientists and banking experts as an explosion of energy and ideas. “On one hand, we have Data Science experts. On the other, we have very experienced experts in banking, and they thoroughly know every issue we are trying to solve.” The last issue the team tackled was how to improve the user experience in the social networks. Findings on clients, their desires and needs in the communication with the bank, services they find important and the ways how they found them – it is all already in the big databases. The answers just need to be properly derived.

Intesa’s COO describes the process of looking for the answers with sports vocabulary. “We start from the long-term vision and we try to follow this vision through short sprints that last 2 to 4 weeks, providing solutions for parts of the problem or improving the existing solutions.”

Data as Our Shield

If we decide to stay within the sports vocabulary, it is important for the banks to remain loyal to the rules of fair-play, stay in the track of respecting rules, laws, and safety of the data. Intesa claims to be fully aware that the mutual trust is in the center of the profession. “There is an ethical, professional and legal limit – and that is good. On the other hand, it is clear that the limit is imposed only over the initiatives that can cause harm to the clients, while for everything else – the sky is the limit”, Stojadinovic concludes.

Because of the fact that there is no limit, the winner is the one that uses this ethically, professionally and legally limited space in a more creative and innovative way. “We use the data that we have to prevent frauds, to analyze and improve security, for risk assessment and analyzing the loan potential”, says Stojadinovic.
This is the way how the data that we willingly share, in the times when everybody is worried about how they are treated, is becoming our shield from frauds. This is how the data scientists and bankers get back to their original mission of preserving trust.