INTUITION vs. DATA

Our company – Things Solver, helps clients make better decisions by analyzing relevant data. During analysis of the data, we often notice significant differences between what our clients would intuitively decide and what their data tells them to do.

One striking example of a gap between our perception and reality is the state of the world around us. Hans Rosling, author of “Factfulness”, demonstrates in a playful, memorable manner, how only through better understanding of data can we have a better understanding of the world that we inhabit. Sounds simple? You think you already have a firm grasp of the world around us? We challenge you to take his simple test.

Chances are you did poorer than a chimpanzee. You are not alone. Furious with ignorance he encountered at such esteemed places as Harvard University and World Economic Forum, Hans set off on a journey to help the world get to know itself. First step in this journey was his data visualization project Gapminder. Second step is his masterpiece: Factfulness.

Factfulness is a book about the world around us and about how we can use widely available data and heuristics to aid our understanding of it. For a taste of the book, feel free to take a stroll down Dollarstreet or watch one of his TED talks.

Factfulness centers around ten instincts that make us perceive the world around us dramatically differently than if we were to zoom in on widely available data. Being mindful of our hard-wired biases lets us calibrate intuition with critical thinking, with an end goal of making better decisions. I’ll mention all the biases, and expand on my favourites in more details:

1. Gap instinct Humans have a basic urge to divide things into two distinct groups, with nothing but an empty gap in between. We often hear about developing and developed countries. This intuitively makes sense. But the data tells us a different story. Developing and developed countries were a roughly correct approximation of reality in the 1960s. The world has completely changed since. Newspaper headlines, often forming our view of the world more than they deserve.

People love to dichotomize. Dividing into two distinct sides is simple and intuitive, with some drama on the side. We dichotomize without thinking and do it all the time. In most cases there is no clear separation of the two groups. Even if averages are showing two distinct groups, the underlying reality is often vastly different. By digging deeper into the data we understand spreads and distributions. We often realize that what we perceived as distinct groups have very much in common.

2. The negativity instinct – Humans are hard-wired to perceive negative stimulus as much more important than positive stimulus. This is the reason why your brain reacts more intensely to a headline about terrorist attacks and ignores hundreds of thousands of lives saved daily by minor improvements in medicine.

3. The straight instinct – We often assume that a trend will continue to go along a straight line without knowing what are the underlying drivers of the trend. Many trends do not follow straight lines. They are S-bends, slides, humps, or doubling lines. “No child ever kept up the rate of growth it achieved in its first six months, and no parents would expect it to.”

4. The fear instinct – Our survival mechanism makes us overestimate risks around us.

5. The size instinct – A favourite of mine and the most useful in practice. A lonely number often seems impressive, but is almost always irrelevant. Ask: compared to what? We usually see absolute amounts thrown around because they are easier to find. Rates of change are often more meaningful.

6. The generalization instinct – Humans unconsciously categorize and generalize all the time. Categories are necessary for us to function. They give structure to our thought. Although very useful tool to explain the world around us, categories could be dangerous and misleading if not used correctly.

Wrong generalizations are mind-blockers for all kinds of understanding. If someone offers you a generalization, a single example and wants you to draw conclusions about a more general topic, you should ask for more examples. Are you happy to conclude that all chemicals are unsafe on the basis of one unsafe chemical? Would you be prepared to conclude that all chemicals are safe on the basis of one safe chemical? Hans gives us few tricks to control these instincts: look for differences within groups, look for similarities across groups, look for differences across groups, beware of “the majority” (could mean 51%, but also 99%), beware of vivid examples and assume people are not idiots.

7. The destiny instinct – relates to the idea that innate characteristics determine destiny of everything around us. This is because truly relevant change happens too slowly to notice. Control this instinct by constantly updating your data-based knowledge. You can also talk to your grandpa and remind yourself how small change could be a huge one over decades.

8. The single perspective instinct – is the idea that all problems have single cause and all problems have a single solution. This kind of thinking saves us a lot of time. We like to feel knowledgeable. We like to feel useful. We like to form instant opinions about any topic raised. We have no satisfaction from having an opinion only about a few topics that we know we are right about. Control this instinct by constantly testing your ideas for blind spots. Be curious about new information that doesn’t fit your view. Seek mental models from other fields. Rather than talking to people who agree with you, seek people that contradict you.

Through his first-hand experience with experts and consultants from a wide range of professions, Hans tells us that, unless they are operating within a very narrow circle of their competence, they are unlikely to be helpful. Experts from one domain rarely know their limits and often, with a remarkable degree of certainty, extrapolate their knowledge of one area into another. This is especially important when using data to make decisions. Expert knowledge, complemented and better informed with data, is the key to making right decisions.

9. The blame instinct – is giving clear, simple reason for why something has happened. It seems it comes naturally for us to decide that when things go wrong it must be because a bad individual, or a group wanted them to. We need to believe that individuals have power. Otherwise, the world feels unpredictable, confusing and frightening. It steals our focus and blocks our learning. Control system would be looking for different causes, not for individuals, but the same goes when an individual claims to have caused something good – you should ask yourself whether the outcome might have happened anyway.

10. The urgency instinct – “Now or never! Learn Factfulness now! Tomorrow may be too late!” This is what sales people and activist love and how they exploit us. Things are almost never that urgent. Yet, we have to admit that this instinct has served us well in the past. Those who stopped in front of the lion to carefully analyze the probabilities are not our ancestors. If something is urgent and important it should be measured. Beware of data that is relevant but inaccurate, or accurate but irrelevant. Only relevant and accurate data is useful and therefore it is crucial to protect its integrity and the credibility of those who produce it.  

 

Professor Rosling used to finish his lectures by swallowing a sword. It was a message for non-believers. They could see with their own eyes how their intuitive assumptions (70 year old man cannot swallow a sword!) crumble.

I would have been thrilled if only I had a chance to attend one of his classes. Please pass his wisdom along.

 

How the Big Data Won the Hearts in Telecommunications

The fact that there is a deep connection between the telecommunications and Big Data is very clear – the main task for telecommunications is in exchanging data. Since the amount of data has enormously increased in the modern era, the experts in telecommunication companies needed some help from the specialised experts.
The need for the experts “that appear out of the blue and solve the matter” is not unfamiliar. A good physician would do the thorough check of the patient, collect data through different tests, set up a preliminary diagnosis, prepare everything for the surgery, draw the path to full recovery. But for the surgery itself, he would call an expert in the area, to assist in the procedure. That is the only way to secure that all the collected test results, findings, x-rays – i. e. data – are used in the most efficient way, and the patient should have the greatest use of those. “The specialist” would see in the data even what the experts for the other parts of the process cannot identify. Telecommunications giants Vip Mobile called Things Solver “the specialists” for Big Data and analytics – the task was to get the most out of the data they collect.

Few Months to See First Results

Vip Mobile Software Architect Goran Pavlovic says that the areas where the specialists’ help was the most needed was the analysis of key business operations segment. The task was to make the network capacity analysis, analyse interactions among the employees during incident and complaint solving process, analyse user interactions in the Web Shop.

Vip Mobile Engineer Djordje Begenisic was involved in the network capacity analysis. He remembers that the task was to make a tool that should help predict and keep track of the network performance and user experience. “It was all focused on the timely detection of capacity issues, and also on the identification of the users which are the least satisfied with the services. With that knowledge and proactive approach, we could notably decrease the level of user dissatisfaction”, Begenisic claims.
It did not take long before the first results. “In just a couple of months, we managed to notably increase the precision level in prediction. That also removed all the doubts in the power of Data Scientists”, Begenisic concludes.
Vip Engineer also adds that spending time with the data experts ready to dig deep was the added value of the whole process. At the same time, it was necessary to turn a joint effort into a success story.

The Road from Excitement to the Result

“Our cooperation begins with the opening excitement while you are presenting the problem, and options how to solve it”, Begenisic describes the joint efforts. “Data Scientist is carefully listening, and then makes a conclusion – which you probably do not even understand.”
“Each of the sides has the key knowledge the other needs in order to create a successful product – the telecommunication experts have the knowledge in that domaine, while Data Scientists bring in the knowledge in the areas of data processing, machine learning”, Software Architect Pavlovic says. Djordje Begenisic discovers us an interesting aspect of the story: “It is a great advantage even if Data Scientist is not fully into the domaine knowledge because there’s a chance to fully analyse the problem without any disturbance that can be brought into the process with the incomplete domaine knowledge.

Brainstorming remains the key part of the multidisciplinary process. Joint effort to look for ideas is the key to success. The result is also part of that joint effort – data science “practises” and perfects its own methods on telecommunication’s big data, while the telco experts have a chance to develop new skills in modern technologies through working with Data Scientists. It is a true win-win story.

Why the Telco Industry seems destined for Big Data

A man is defined in numerous scientific ways, but one definition seems unchanged since the beginning of time. The man is the creature that communicates – from the first attempt to speak, till the last tale told to grandchildren.
“The telecommunications are defined as the exchange of the information between the source and the destination, with the use of technologies. The transmitting ways vary and are getting more complicated as the communication is becoming multi-dimensional. It all started with smoke signal communication, that is considered to be the first ‘digital’ communication ever”, Vip Mobile engineer Djordje Begenisic describes the industry he works in.
But since the smoke signals, the reality has significantly changed for telecommunications. Smoke signals are easy to be counted and understood; now the things have become more complicated. “In this industry, we have a lot of meeting points with our user and we can collect information in real time. We get that data through the sales process (CRM systems, Webshop, self-care systems), through the network (every use of a phone or any other device that communicates with the network leaves a trace), through the post-sales process and activities (bill payment, purchase of additional options, credit charge”, seems like Vip Mobile software architect Goran Pavlovic could go on counting the examples forever.
But it took some time for the telecommunication companies to expand their interest to Big Data. For a while, they used those only for marketing needs and financial analysis. The companies like Things Solver inspired them to the new discoveries and further thinking.

How to “Swim in the ‘Data Lake’”

“Telecommunication companies are a bit like dinosaurs – we need energy injections and impetus to direct us towards the new trends in order to improve the competitiveness and perform the digital transformation the right way”, says Begenisic. He compares the amount of data the industry possesses with a lake, and for that, you need good swimming skills.

To avoid dinosaurs turning into the “Loch Ness Monsters”, data scientists provide the “life vests”. Data Science helps the optimization of the network performance, better capacity allocation and planning, optimization of internal processes, more efficient and faster reactions to the user- reported issues, foreseeing future user problems and reacting to them in advance.

Pavlovic explains that the joint efforts of Vip Mobile and Things Solver started first with solving the “smaller” problems. “Lead by the “start small & fail fast” motto, we managed to bring up several solutions to the commercial use in two companies, while in some cases, we saw very early that the results would not be expected, so we allocated our resources to the other projects”, he concludes.

The study “2017 Big Data Analytics Market Study” by Dresner Advisory Services, published by the Forbes magazine, found that the use of Big Data has reached 53% companies in the US – the magic has reached more than a half of the business world. The data is even more stunning when compared to the number of 17% companies that existed in the Big Data and Analytics world, back in 2015. It is not hard to guess that the top of the list is reserved for telecommunications and financial services companies. The study showed their need for getting their data in order is the greatest. As the main reason for entering the analytics world, the companies list the user behavior analysis and social analysis, which should result in a more predictable business environment.
Serbian telco market is no different. “The skills of data management and the proper understanding of their content within the legal limits make the difference between the two telecommunications companies. Today, everyone has 4G networks, 5G is coming to all in few years, the differences in speed and ping are decreasing… So we need to find a new field for competing at the market”, Djordje Begenisic concludes. And it’s not just finding another field – it’s also winning the game, this time with the knowledge that adds extra value to the already existing data.

Interactive log analysis with Apache Spark

The Internet is becoming the largest global shop across markets, and anyone who is offering products and services of any kind prefers for web shops to become the primary outlets to supply customers. This leads to a reduction in the number of employees and traditional brick and mortar branches and reduction in costs, so it is clear that the customer behavior analysis on digital and online channels is of great importance. For this reason it should not be surprising that many companies accept this kind of analysis as a basic need.

In this post I will not focus that much on Spark itself, since the Apache community has an excellent documentation. It is enough to mention that Apache Spark is the most common Big Data tool for processing large amounts of data, with rich APIs for machine learning, streaming data, graph analysis,  etc. The  main feature of Spark is that the processing is performed in-memory which makes it extremely fast – even  up to 100 times faster than traditional MapReduce, and that is why it is recognized as the  most valuable tool in cases like this, when dealing with hundreds of gigabytes of data.

Spark provides APIs in Scala, Java, R, SQL and Python. In this post we are going to use the last one, which is called PySpark. In terms of data structures, Spark supports three types – RDD, Datasets and DataFrames. Datasets and DataFrames are built on top of the Spark SQL engine and that’s a reason for more efficient  way to handle the data compared to RDD.

Now we can move to the subject of this post. The main focus will be some basic methods for parsing website logs in PySpark, with a little help from SQL for making life easier, and several pretty neat Python libraries specifically designed for this purpose.

The first step after library imports  would be creating SparkSession and determining the path to the input file. As input I will be using synthetically generated logs from Apache web server, and Jupyter Notebook for interactive analysis.

Let’s start building our Spark application. The first step is to build a SparkSession object, which is the entry point for a Spark application…

[code language=“python”]
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark import Row

spark = SparkSession.builder.appName(“Logs Streaming Parser”).getOrCreate()
mydata = spark.sparkContext.textFile(‘/Users/sinisajovic/Downloads/logs.log’)
[/code]

Data is rarely 100% well formatted, so I would suggest  applying a function that will reduce missing or incorrect exported log lines. In our dataset if there is an incorrect logline it would start with ‘#’ or ‘-’, and only thing we need to do is skip those lines.

Here is a sample Apache server log line:

[code language=“python”]

127.0.0.1.800.00, 127.0.0.1.800.00 – – [08/Feb/2017:16:33:27 +0100] “GET /api/house/get_for_compare?idn=33&code=99992&type= HTTP/1.1” 404 19636 “mywebsitelocation.com” “Mozilla/5.0 (Linux; Android 5.0.1; GT-I9505 Build/LRX22C) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36”

[/code]

To split log lines into useful columns we are going to write a regex. Pythex.org is a good place to test whether the regex suits our needs.

In my case the regex looks like this:

[code language=“python”]

(.+?)\ – – \[(.+?)\] \“(.+?)\ (.+?)\ (.+?)\/(.+?)\” (.+?) (.+?) \“(.+?)\” \“(.+?)\”

[/code]

We are going to use user agent library, since it has pretty good ability to identify primary device characteristics, such as brand name, model, type, operating system that runs on device, browser used to surf and so on. Also, we can check if is it touch device or not, or even if it’s a bot…

[code language=“python”]
from user_agents import parse
from urlparse import urlparse
import requests
import datetime

def device_type(user_agent):
options = {1: ‘mobile’, 2: ‘tablet’, 3: ‘pc’, 4:’touch’, 5:’bot’}
if user_agent.is_mobile:
ty = 1
elif user_agent.is_tablet:
ty = 2
elif user_agent.is_pc:
ty = 3
elif user_agent.is_touch_capable:
ty = 4
elif user_agent.is_bot:
ty = 5
else:
ty = 6
return options.get(ty, ‘unknown’)
[/code]

The main part of our script is the function parse_line, for getting the structure of defined structure of defined fields by parsing, using our regex pattern (LOG_PATTERN), user agent, and tldextract libraries. As output we are retrieving columns with matched groups of user agent strings.

[code language=“python”]
def parse_line(logline):
if len(logline)<1 or logline is None:
return ‘Other’
try:
match = re.search(LOG_PATTERN, logline)
user_agent = parse(match.group(10))
domen = tldextract.extract(match.group(9)).domain

return Row(
IP_protocol=match.group(1),
Timestemp=match.group(2),
Request_type=match.group(3),
Request=match.group(4),
Type=match.group(5),
Version=match.group(6),
Status=match.group(7),
Size_of_response=match.group(8),
Referer=match.group(9),
Browser=user_agent.browser.family,
Device=user_agent.device.family,
OS=user_agent.os.family,
Mobile=user_agent.is_mobile,
Tablet=user_agent.is_tablet,
Touch=user_agent.is_touch_capable,
PC=user_agent.is_pc,
Bot=user_agent.is_bot,
Domain=domen

)
except:
pass
[/code]

The last match group in this case is domain, parsed with tldextract library. The tldextract library takes the url link and extracts the domain information from the link. After we’re finished with processing with the libraries that I’ve introduces you to, we need to create the schema. This is nothing else than creating a table like structure, matching fields with previous functions and assigning the column names. You can check just a few lines of code to how to create schema for our dataset:

[code language=“python”]
schema = StructType([
StructField(‘Browser’, StringType(), True),
StructField(‘Device’, StringType(), True),
StructField(‘OS’, StringType(), True),

StructField(‘PC’, StringType(), True),
StructField(‘Bot’, StringType(), True)
])

logs = spark.createDataFrame(data=log_parsed, schema=schema, samplingRatio=None)
[/code]

This is maybe a too much of code samples, but I find it somehow much more efficient and helpful for understanding the basic logic of our approach. Since we want to use the SQL capabilities, let’s create a temp table.

[code language=“python”]
from pyspark.sql import SparkSession
from pyspark.sql.types import *
log_df.registerTempTable(name=’data’)
log_df.createOrReplaceTempView(name=logsdata)
devicedf = spark.sql(‘select Device, count(Device) from log_df group by Device’)
browserdf = spark.sql(‘select Browser, count(Browser) from log_df group by Browser’)
[/code]

With this implementation we’re allowed to write SQL queries with huge benefits for filtering and sorting data. Very important thing to do is filter poorly processed lines. For further processing, good approach would be to create a DataFrame for each specific category. The way to do that is to select category and do count and group by. We can do this in pandas too, but SQL gives us a much better performance. This is the last step in our tutorial before skipping to one of the most popular Python library for handling with structured datasets, known as pandas. To simplify the usage of matched groups, create pandas dataframe for each of them.

For plotting we’re going to use matplotlib library.

[code language=“python”]
Import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.patches as mpatches

devicetype = device_df.plot.bar(x=[‘Mobile’], color =(‘firebrick’,’steelblue’,’darkgrey’), figsize=(8, 8), title=’Devices’, width = 0.5, fontsize = 15);
devicetype.set_xlabel(‘Device type’)
plt.figure();

operatingsystems_bar = operatingsystems _df.plot.bar(‘OS’, color=’Red’, figsize=(8, 8), yticks = operatingsystems _df[‘count(OS)’], title=’Operating Systems’, width = 0.5, fontsize = 15);
plt.show()
[/code]On the first plot we can display the device information – whether it is a tablet, touch , mobile, PC or of some other type.
devicetype_bar
We can also dig into the operating systems most commonly used on devices. There are many plot types supported, and they are pretty well documented in the official library documentation.
Operating systemsLet’s  see which are the most represented mobile phones and the browsers for our users.
BrowsersPhones

On this plot we see that iPhone is on the first place. The reason for this lies in the library itself – the library does not recognize the exact model of iPhone brand, unlike the others.
We can see a pretty tremendous domination of Samsung in a place of budget phone market and Chrome in a browsers sphere (at least in Serbia based on our sample)
Also, it’s evident how Facebook marketing is essential and decisive, since Facebook in-app browser is on the third place.

This is just a brief introduction to exploratory log analysis with Python, pandas and matplotlib. Pandas provides us capabilities to do a much deeper analysis using DataFrames, which will be the main focus of my future posts. We will also move to a more advanced analysis, such as mapping user behavior on a website, user segmentation and recommender systems.