Friday talks: A Data Science Project

This post is not going to be about another Data Science course you should enroll in. It’s not going to be about various skills you should build in order to develop a Data Science project, either. Considering the title of this post – A Data Science Project – I tried to create a pun. Your journey to the destination called “I am a Data Scientist” is a project you should be working on, with phases, iterations, and disagreement between the user requirements and generated outcomes. I would like to talk about my Data Science path, and what it is like to be a Data Scientist from my perspective. I can assure you that there are tons of blog posts on the web that are sharing the same topic, enriched with more information and experience than my own, but the thing is – I want to talk about heading this way and share with you some unconventional directives that made my journey a lot easier, and hopefully, would do the same for you.

So, regarding the beginnings, there are some baby steps you should make, in order to build basic skills needed for the purpose of analysis and extracting insights. And you really do it well. The beginnings are no longer a problem. Most of you start with the Machine learning course held by the incarnation of a deity in the world of machine learning, Andrew Ng. Or with DataCamp. Or at Kaggle. And that’s the right way to do it. But there are some additional activities you can practice, that will make it a lot easier for you to master this field and/or to enrich your experience and spread your collection of skills.

1. Research & Blogs

Being a Data Scientist requires lots of research. In order to extract the most possible from the data, you should be aware of the limits. And the limits are constantly changing. How to know where the limit is? By doing some serious research! Follow what’s the academia doing, but also how is the implementation going in the industry. Besides academic research and scientific papers on some particular subject, I read lots of blogs on a daily level. Some of the blogs that I personally like are Analytics Vidhya, Towards Data Science, Machine learning mastery, Brandon Rohrer’s blog, and Colah’s blog. Or, you can install Flipboard, set your topics of interest, and follow up.

2. Meetups & Conferences

Communities and gatherings are some precious things in this field. Lots of enthusiasts and experienced people can be found on such events, sharing their knowledge and findings. At Things Solver, we really believe in the “sharing is caring” idiom, and with that in mind, we try to share our knowledge and to let it grow even more through these events. There are many meetups in Serbia with Data Science, AI, and related topics, so you can start with exploring the Meetup.com and areas you’re interested in. The most popular Data Science community is Data Science Serbia, organizing meetups, usually encouraging bonding and networking of Data Science enthusiasts. As for the conferences in Serbia, the most popular one for certain is Data Science Conference, growing bigger each year.

3. Social networks & Influencers

Social networks are a good way to follow the activities and events, even though you’re not able to be there physically. What I really use on a daily level is LinkedIn. There are some inspiring people that I follow and learn from, like Favio Vázquez, Brandon Rohrer, Jason Brownlee, Andriy Burkov and many, many more.

People are often underestimating these things, but they really are a crucial part of a continuous Data Science path. And that is one of the biggest problems one encounters at the beginning. Like every other field, it requires dedication, research and lots of learning. And, since it is continuously growing, one should simultaneously grow alongside, in order to be at the top, comfortable with the cutting edge technologies. And, to be honest, that is not easy.

The wanderer’s puzzle

The first thing I want to discuss is something I call “the wanderer’s puzzle”. And I want to open this section with the Tolkien’s words ”Not all those who wander are lost…”. So, entering this field (or any other field), you’re probably feeling lost. But what’s the right thing to do Data Science? It depends. There is no such thing as a recipe with perfectly determined doses of ingredients. The first thing is to wander. To find yourself. And I have a really interesting story to share with you, called The Hedgehog and the Fox. My dear colleague Anđela shared this story with me, and it really helped me find myself. I pulled the analogy regarding this topic and Data Science. You have to determine whether you’re a hedgehog or a fox. It depends on your interests. You can either be a hedgehog, focused on mastering one thing, or a fox, squirming thought various domains at the same time. Regarding the Data Science, I know many colleagues that are totally hedgehogs (they are experts in computer vision, for example, but they have never heard of Isolation Forest or a Survival curve). And similarly, I have lots of colleagues who are foxes, they have played with CNNs, time series analysis, store optimization in various domains like marketing, finance etc.,  but they always say they haven’t yet dug any of these areas deeper.

The imposter syndrome

Another thing that I would like to talk about is confidence. Reading lots of blogs, listening to many technical courses and presentations, I’ve really had hard times believing in myself and building confidence and self-awareness. Never thought about the real problem I was facing – called the Imposter syndrome. So, the imposter syndrome… This is a situation where you’re doubting yourself, your competences and knowledge, afraid of being exposed or flagged as a “fraud”. This is a frequent problem, and lots of successful people are facing it. You know that there will always be someone with more experience, more knowledge, better competences. That’s not the problem. The problem is that you think you’re not good enough. That your acknowledgments are not yours, but the merit of someone else, or accidental series of happy circumstances. And that it’s only a matter of time when someone will break you and ruin your career and everything you’ve accomplished. I was lucky to have a conversation with a more experienced Data Scientist, who pointed me to this problem. And I have a perfect read on this topic here. So, stop doubting yourself and keep rocking the Data Science!

Development vs. production

When looking at the practice and the real-world application, there also are some key drivers you should be aware of, in order to keep up the trace and save your stamina. And that’s not something that you can easily learn or hear about just around the corner. Dealing with some real-world Data Science projects, I have learned one crucial thing. You should never (like, EVER !) look at the Data Science project development and production as two separate things. They are done in separate phases, they can be done by separate teams, they can eventually be separated by the environments and the conditions they are running in. But they should always be regarded as a whole, a unity, a completeness. Now, I know that you’re asking yourself – why would I possibly divide those things – yet again, I am sharing my experience, and yes, I made this mistake. And learned from it.

Each Data Science project starts with a problem that should be solved. The solution of the problem should lead to business improvements, reflected in revenue increase, cost reduction, or whatever the desired metric is. There are several phases in the development process, as well as in the deployment, and this flow is usually divided between several teams. Due to the numerous phases and iterations in the process, lots of things can happen, potentially leading to complications and project failures. It is unnecessary to emphasize that everyone involved should be completely dedicated and aware, for this process to be perfect. So, is there anything that you can do (or avoid doing), as a Data Scientist, in order to make this process as fluent as possible? Yes, for sure! In many cases, Data Scientists are described as lazy and messy. Why is that? We develop our models and test it in some environments that are not even IDEs, but some kind of a browser tab! We love the interactivity and line by line execution! And that really comes in handy during the development phase, when playing with the data and different models. We have a pretty narrow focus on finding the right model, putting everything else (like data withdrawal, code modularity, results delivery, etc. ) aside. The problem appears when you’ve chosen a satisfying model. You cannot just throw it around to the teammate assigned to the deployment, like it is a hot potato! And that’s the biggest issue in every project. In most cases, especially when you are rookie, models are not production-ready. And it can lead to lots of headaches. Data Scientists often neglect the steps that are coming after the model training phase. And that is pretty irresponsible and not aligned with the team spirit you should have! You have to think about production and model deployment. You have to communicate with the ones responsible for the model deployment. And, if it’s you that is also deploying the model into production, you should be responsible to yourself, too.

My most sincere recommendation is to always think about the whole process. The things you should always take into consideration are model scalability, generalization, adaptation, optimization, and additional tweaking. Write code that is readable and easily upgraded. Parameterize everything that is prone to changes. Develop models that can easily be enriched with more data. Create pipelines. And, even if you’re a researcher or a “lazy” Data Scientist in a team consisted of both Data engineers and Machine learning engineers, make sure that you understand the whole process, at least. You’re not an independent entity in the project. The process will be much faster and more efficient if you take these into account from the beginning. And not to mention the project flow and success rate. Finally, you are a Data Scientist. It’s not only about 95% accuracy. It is about the impact of the whole process. You have to understand why you’re doing it. But also how that is changing the environment you’re in. And that is much more satisfying than the 95% accuracy, to be real. If a model with 68% accuracy is driving the changes and creating the business value – I’ll totally be up to that!

There is one last thing I want to share with you. How do you continuously grow? The following are three very simple, but powerful steps I stumbled upon while browsing the net (check out the whole post here, it really is valuable).

Identify your weaknesses

Define a plan that should convert your weakness to your forte

Execute the plan

 

Simple as that, ain’t it? 🙂