Friday talks: The Holy Grail of Machine Learning

Friday talks: The Holy Grail of Machine Learning

Pedro Domingos is a professor of computer science and engineering at the University of Washington. He is a winner of the SIGKDD Innovation Award, which FYI is the highest honor in Data Science. They say that approximately seven years is the time period needed to become an expert in the field of Data Science. This man held one of his first lectures about Artificial Intelligence back in 2001. He has decades of research, hard work and engagements behind himself. He is the author of The Master algorithm book. Since I like this book a lot, I would like to share the main impressions in this blog post. I know that we all worship Andrew, but as for me, I have found yet another god here on Earth. Pedro Domingos is a name that I missed to mention in my previous post. And there’s a hole bunch of important names that totally slipped my mind. But fortunately, I have a chance to make it right. 🙂 If you liked the story about the hedgehog and the fox, stay tuned, because it’s going to be even more interesting! 😉

The five tribes of machine learning

In his book, Pedro is talking about something he calls – the machine learning tribes. What is it all about? I bet you can anticipate it. As in every other field, there are different streams that you can follow. And that depends on you beliefs and preferences. Now this is the point where this whole discussion about generalist and specialists escalates. So, it’s not only about the simple partition to generalists and specialists, but also about the paradigms you pursue, which kinda makes things even more complicated. 😀 Or we may call it – interesting. It’s like I always knew that there are different mindsets and approaches, but couldn’t quite draw the exact line between them. Pedro lists following basic directions: connectionists, evolutionaries, bayesians, analogizers and symbolists. Before digging deeper into these terms, let’s have a general say. Each approach finds its origin in different fields of science and research. Each one of them has its own master algorithm, the general purpose learner that can (or at least – it is believed it can) be used to learn everything. If you give it enough data. And computer power. And some more data. And although you may gravitate to one of these approaches, you cannot tell for sure which one is the best or the most powerful. Each one of them has its fortes in the specific problem domains. On the other side, each one of them has its own shortcomings.  Let’s take a walk through these “schools of thoughts” in machine learning.


This is maybe the most popular approach nowadays, since it includes neural networks and thus, deep learning. Connectionists believe that problems need to be solved the way humans do that. So they find their baseline in neuroscience, trying to emulate the human brain. To be precise, connectionists get nuts when someone says that neural networks are emulating the human brain. Because they aren’t. Human brain is much more complex, and cannot be easily emulated, since there are many things yet to be discovered. Neural networks only use human brain as an inspiration, rather that the model which is blindly being followed. They consist of computing units – neurons, that take inputs, calculate their weighted sum, pass the result through some activation function and feedforward that output to the neurons in the following layer. The general idea is simple – take this input, let the neurons perform magic, and return some output. But how do the neural networks learn, how do they generate the output they should? By using backpropagation, which is regarded as their biggest advantage. The error is propagated back from the output to the previous layers of network to perform tweaking of the weights, in order to make the error smallest possible. And they really are powerful and widely used. Image recognition, cancer research, natural language processing, you name it. But there still are some shortcomings, like the amount of data required. Or lack of interpretability. Long way to go, but connectionists believe they might become the all-mighty algorithm once.


The evolutionaries go one step further than connectionists. They claim that the evolutionary process is far more powerful than human reasoning process, since, in the end, the evolution has driven reasoning as it is nowadays. They take roots in evolutionary biology. There are several most famous evolutionaries: John Holland developed genetic algorithms, John Koza developed genetic programming, while Hod Lipson continued to this development through evolutionary learning. How does the genetic algorithm work? At the beginning of the process, we have a population of individuals. In the centre of attention is a genome, which describes each individual (in computation represented in bits). So, each individual is evaluated for the specific task it should be solving, so that best fit individuals have bigger chances to survive and be the parents of the next generation. In this process genomes of two individuals get combined to create a child genome, which contains one part of a genome from each parent. Like in evolution, random mutation of genomes can happen, and so it is in the evolutionary learning, so we practically get a new population in each generation. This process is iteratively done until we get a generation with best fit individuals able to solve the problem optimally. While the connectionists approach is only adjusting weights in order to fit a fixed structure, the evolutionaries approach is able to learn the structure, to come up with the structure itself.


Now, the Bayesians find their origins in statistics. Since everything is uncertain, they quantify the uncertainty by calculating the probability. They bow to Bayes theorem. But how is their learning process being conducted? At first, they define some hypotheses. After that, they calculate the prior probability for each hypothesis, meaning how much do they believe the hypothesis is true before knowing anything about it. Pedro notices that this is the most controversial fact in their learning process. How can you pre-assume something given no data? But as the evidence comes in, they update the probability of each hypothesis. They also measure the likelihood, which tells how probable is the evidence, given the hypothesis is true. After that, they can calculate the posterior probability which tells how probable is the hypothesis given the observed evidence. Thus, the hypothesis consistent with the data will have increasing probability and vice versa. The biggest forte of this approach is its ability to measure the uncertainty. And that frequently is the problem. Generally speaking, the new knowledge we generate is uncertain at first, and it is good to quantify that uncertainty. Maybe you’re not quite aware of it, but self-driving cars have Bayesian networks implemented inside, and they use them for the learning process. Some of the most famous bayesians are Judea Pearl, who developed powerful Bayesian networks, Davin Heckerman and Michael Jordan. It is said that bayesians are the most fanatic of all the tribes, so I wouldn’t mess with them, to be sincere. 😉


The basic idea brought by the analogizers is reasoning by analogy. To transfer the solution of the known situation to a new situation faced. As you may refer, they find their origins in several areas of science, but mostly in psychology. Peter Hart is one of the most famous analogizers, since he dealt a lot with nearest neighbour algorithm. Vladimir Vapnik is the inventor of the support vector machines, known as the kernel machines. As another analogizer, Douglas Hofstadter has been working on more sophisticated topics, he presented in his book. Pedro says that in one of his books, Douglas even spent five hundred pages just arguing that all of the intelligence is pure analogy. Now, maybe we need to reconsider the fanatic ones here. 😀 So, basically, what they claim is that we can expand our knowledge by investigating new phenomena through driving the analogy with other known phenomena. The most famous application of analogy based learning are recommender systems. And it really does make sense! If you what to determine what to suggest next to a customer, then check what similar customers liked, and analogically place similar offers to the given customer. The best competence of the analogizers is the ability to generalize from just a few examples. We often encounter unknown problems, and their approach by learning using the similarity is a good problem solver in those cases. Simple as that, ain’t it?


The symbolists find their origin in logic and philosophy. Their main purpose is filling in the gaps in the existing knowledge. Learning is the induction of knowledge. The induction is basically the inverted deduction. General rules are made of specific facts, so the process of induction includes going from specific facts to general rules. But not only that.  Based on some known rules, they induce new rules, combine them with the known facts, and raise questions that were never asked before, which is leading to new rules and answers that enrich the knowledge more and more. Practically, they start the process with some basic knowledge, then they formulate hypothesis, design experiments, carry them out and given the results, hypotheses are refined or new hypotheses are generated. This is something closest to the way in which the scientists generally approach the research. Thus, this approach is mostly used in robot scientists creation. One of the most famous Pedro’s examples of inverse deduction is molecular biology robot, Eve. Eve has found the new malaria drug, by using the  inverse deduction. Some of the most prominent symbolists that Pedro numbers are Steve Muggleton, Tom Mitchel and Ross Quinlan. The biggest advantage of this approach is that they are able to compose the knowledge in many different ways, by using logic and inverse deduction.

To sum it up.

Approach Problem Solution
Connectionists Credit assignment Backpropagation
Evolutionaries Structure discovery Genetic programming
Bayesisans Uncertainty Probabilistic inference
Analogizers Similarity Kernel machines
Symbolists Knowledge composition Inverse deduction

One algorithm to rule them all…

What Pedro states is that eventually – it’s not about the partition itself. Is about finding the unique solution. The universal learner. The all-mighty model. The Master algorithm. So, each of these five schools has its own advantages. Can we take what’s the biggest forte of each one of them, combine them, and get the master algorithm able to solve each machine learning problem we give it? Pedro states this is the goal. Is it to complicated? Doesn’t necessarily have to be. He divides all these approaches into three main modules: representation, evaluation and optimization. The representation tells us how the learner represents the learning process. And in most cases, this will be reflected in common sense logic (but it could be in differential equations, polynomial function, or whatever). In the book, the unification of this process comes by combining the symbolists and bayesians approach. Since symbolists use logic, while bayesians use graphical models – their combination can represent any type of problem one can encounter. The evaluation is used for measuring the performance of the model (in terms of pattern recognition, data fitness, generalization, etc.). The master algorithm should be able to take any of the evaluation functions available in these five approaches, and provide the possibility for user to decide which one of them will be used. The optimization is the task of finding the  best-fit model for a given problem. This includes discovering formulas that will mathematically describe a given problem, by using the evolutionaries genetic programming approach, and optimizing the weights in these formulas, which can be solved with the backpropagation algorithm used by the connectionists. Pedro believes that the master algorithm will be able to solve any problem given. Most practical examples include home robots, cancer cure(s), 360-view recommender systems, etc. The list goes on and on.

I would like to close this post with his inspiring words:

Scientists make theories and engineers make devices. Computer scientists make both theories and devices.

Ain’t it? The book is pretty concisely written, and can be read in a breath! Please accept this book as my warmest recommendation for a good read, and just enjoy it. Hopefully it will be as inspiring to you as it is for me. Share your impressions, can’t wait to hear them! :3