Hello Docker

Having spent couple of weeks on data preparation and developing that particular machine learning model, you are finally ready to show off with some really good results to your boss. You have your notebooks with lines of code doing magic, maybe some reports in Excel,  amazing visualizations in Plotly etc. It’s 5 minutes till your presentation in front of business stakeholders and you are feeling pretty confident in your work. You delivered a presentation, everything went well and your team lead gave you that congratulatory pat on the shoulder with words You should put this into productionIf you are novice in the field, you may be little bit worried and thinking How am I supposed to make this work smoothly in the production? Don’t be! 🙂

There is something that you can use to really finish your project in style and that something is called Docker. Chances are that you have already seen it mentioned somewhere in the news feed of your favourite data science blogs and Facebook pages or used in tech tutorials. Yeah, that little blue whale on the icon. So, let’s get started with Docker and deploy your model in short time.

Firstly, Docker is a tool that allows developers, sys-admins etc. to easily deploy their applications in something called containers to run on the host operating system. Main benefit of using Docker is that it allows users to package  an application with all of its dependencies into a standardized unit (mentioned as container above) for software development. Unlike virtual machines, containers use much less computer resources. They are fast and lightweight. Key distinction between containers and virtual machines is that containers share the host system’s kernel with other containers.

As you can see from the scheme above, there are three main components of Docker architecture: Docker Client, Docker Host and Docker Registry. In this part we will try to define key terms you will need to understand in order to deploy your machine learning model through some application. So, let’s begin!

Images – An image is a read-only template with instructions for creating a Docker container. These are the blueprints of your application which form the basis of containers. We pull an image from the Docker Hub with a command docker pull. By typing this command, you can download the image which is then used for containers.

To build your own image, you create a Dockerfile with a simple syntax for defining the steps needed to create the image and run it. Each instruction in a Dockerfile creates a layer in the image. When you change the Dockerfile and rebuild the image, only those layers which have changed are rebuilt. This is part of what makes images so lightweight, small, and fast, when compared to other virtualization technologies.

Volumes – These are the data part of a container, initialized when a container is created. Volumes allow you to persist and share a container’s data. They can be created explicitly using docker volume create command, or Docker can create a volume during container or service creation.

Containers – These are created from Docker images and run the actual application. Container is created by using the docker run command. Additionally, if you type docker ps you will get a list of running containers. If you need more information, you can type docker ps -a  and get a list of all containers, even those that were stopped. Container runs natively on Linux or Windows and shares the kernel of the host machine with other containers. It runs a discrete process, taking no more memory than any other executable, making it lightweight.

By contrast, a virtual machine (VM) runs a full-blown “guest” operating system with virtual access to host resources through a hypervisor. In general, VMs provide an environment with more resources than most applications need.

Docker Daemon – This is the background service running on the host that manages building, running and distributing Docker containers. Daemon is what actually executes commands sent to the Docker Client.

Docker Client – It is a command line like tool that allows users to interact with Daemon. Think of it as a user interface tool for Docker.

Docker Hub – This is a registry of Docker images. You can think of it as a directory of all available Docker images. You can use either one of the officially maintained images, or  user images. Chances are that there is already an image that can suits your needs. All you have to do is to go to a Docker Hub website and do a quick search of the image database.

Dockerfile – It is a simple text-file that contains a list of commands that the Docker Client calls while creating an image. It is a simple way to automate the image creation process. Once you’ve got your Dockerfile set up, you can use docker build command to build an actual image. Dockerfile defines what goes on in the environment inside your container.

 

Let’s go through practical example in order to make sure you understood all that we’ve covered till now.

After installing Docker on your computer, you can test installation by running the docker run hello-world command.

jglisovic@jglisovic:~$ sudo docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
https://docs.docker.com/get-started/

jglisovic@jglisovic:~$

 

Here, if you read the whole output you will see all the steps Docker made in order to execute command you had entered. Those steps can help you understand Docker architecture scheme from above.

Our next step is to define structure of the files and directories in your main directory for this Docker project. You will need following files: Docker file which will specify our application-specific environment for deployment, Python file called app.py which will have all those lines of code doing amazing data science stuff and bash.sh file which will be used to automate execution. (Writing a bash script is just an option for more convenient execution. You can do it other way if you prefer.) You can see the structure on the picture below.

Let’s go step by step now. Firstly, you will need to make your app.py file. For the example’s sake I’ve made a file that takes a .csv file as an input, then does some processing, data aggregations and finally exports aggregated data into two new .csv files. You can see app.py code in the picture below.

import pandas as pd

# import data
df = pd.read_csv("/data/segmentation_data.csv")

df = df.rename(str.lower, axis='columns')

df['cluster'] = df['cluster'].astype('str')
df['observation_date'] = pd.to_datetime(df['observation_date'])
df['birth_dt'] = pd.to_datetime(df['birth_dt'])

# stats by cluster_name, cluster
df_agg = df[['cluster_label', 'cluster', 'parameter1', 'parameter2', 'parameter3', 'parameter4', 'parameter5', 'parameter6']].groupby(
    by=['cluster_label', 'cluster'], as_index=False).agg(
    ['count', 'mean', 'std', 'min', 'median', 'max']).sort_values(
    by=['cluster_label', 'cluster'], ascending=False)
df_agg.columns = ['_'.join(x) for x in df_agg.columns.ravel()]
df_agg.reset_index(inplace=True)

# stats by cluster_name, cluster and parameter7
df_agg_p7 = df[[
    'cluster_label','cluster','gender', 'parameter1', 'parameter2', 'parameter3', 'parameter3', 'parameter5', 'parameter6'
    ]].groupby(
    by=['cluster_label', 'cluster', 'parameter7'], as_index=False).agg(
    ['count', 'mean', 'std', 'min', 'median', 'max']).sort_values(
    by=['cluster_label', 'cluster', 'parameter7'], ascending=False)
df_agg_p7.columns = ['_'.join(x) for x in df_agg_p7.columns.ravel()]
df_agg_p7.reset_index(inplace=True)

# export aggregated data to csv file
df_agg.to_csv("/data/results_table1.csv", index=False, header=True)
df_agg_p7.to_csv("/data/results_table2.csv", index=False, header=True)

 

In the next step you will need to make a Dockerfile. As already mentioned, Dockerfile will specify the environment for the app.py execution, smooth execution on any machine that needs your app.py deployed. In the picture below there is an example of Dockerfile.

FROM debian:8

MAINTAINER Jasmina Glisovic <jasmina.glisovic@thingsolver.com>

ENV LANG=C.UTF-8 LC_ALL=C.UTF-8

RUN apt-get update --fix-missing && apt-get install -y wget bzip2 ca-certificates \
    libglib2.0-0 libxext6 libsm6 libxrender1 \
    git mercurial subversion

RUN echo 'export PATH=/opt/conda/bin:$PATH' > /etc/profile.d/conda.sh && \
    wget --quiet https://repo.continuum.io/miniconda/Miniconda3-4.3.27-Linux-x86_64.sh -O ~/miniconda.sh && \
    /bin/bash ~/miniconda.sh -b -p /opt/conda && \
    rm ~/miniconda.sh

RUN apt-get install -y curl grep sed dpkg && \
    TINI_VERSION=`curl https://github.com/krallin/tini/releases/latest | grep -o "/v.*\"" | sed 's:^..\(.*\).$:\1:'` && \
    curl -L "https://github.com/krallin/tini/releases/download/v${TINI_VERSION}/tini_${TINI_VERSION}.deb" > tini.deb && \
    dpkg -i tini.deb && \
    rm tini.deb && \
    apt-get clean

RUN apt-get update && apt-get install gcc -y
RUN apt-get install g++ -y
RUN apt-get install apt-transport-https debconf-utils build-essential apt-utils -y
ENV PATH /opt/conda/bin:$PATH
RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

RUN /opt/conda/bin/pip install pandas==0.23.4
RUN /opt/conda/bin/pip install SQLAlchemy==1.2.11
RUN /opt/conda/bin/pip install scikit-learn
RUN /opt/conda/bin/pip install xlrd
RUN /opt/conda/bin/pip install flask
RUN /opt/conda/bin/pip install dash==0.35.1
RUN /opt/conda/bin/pip install dash-html-components==0.13.4
RUN /opt/conda/bin/pip install dash-core-components==0.42.1
RUN /opt/conda/bin/pip install dash-table==3.1.11
RUN /opt/conda/bin/pip install psycopg2==2.7.6.1

ADD app/. /app

WORKDIR /app

CMD [ "python", "./app.py" ]

 

Dockerfile starts with a command FROM which is specification of the base image that will be used. In our case that is the image called debian (version 8). After the command MAINTAINER you specify who made Dockerfile. ENV command is used to specify environment variables which will be set in the container and available during runtime. ENV is, in our case, followed by a couple of RUN commands in which we specify all the packages and other dependencies that will be needed for smooth execution of app.py file. ADD command specifies source and destination of files. It is used to initialize the container file system with application-specific files. WORKDIR is used to specify working directory. This is the directory we would end up in if we entered into a bash shell and tried to run commands in the container interactively. In our case working directory is the app directory where our app.py file is located. Finally by typing CMD [ “python”, “./app.py” ] we tell python to run app.py file.

Our next step is to make a bash script that will build our image specified by Dockerfile and run the container. In the picture below you can see bash.sh file.

#!/usr/bin/env bash

# RESOURCE_PATH="~/PycharmProjects/docker resources"

echo "Build docker" &&
docker build -t docker_course ../main_app &&

echo "Find container if exist and kill" &&
echo "Container ID" &&
echo $(docker ps -aqf "name=docker_course") &&
echo "STOP container" &&
docker stop $(docker ps -aqf "name=docker_course") ||

echo "Remove container" &&
docker rm $(docker ps -aqf "name=docker_course") ||

echo "Deploy new container" &&
docker run --name docker_course \
    -v $1:/data docker_course &&

echo "App and container deployed"

 

First we told the Docker to build image docker_course. After that we will check if there are any containers named docker_course in the system, stop them and remove them. Finally, we will run docker_course and deploy new docker_course container.

I hope you found this post helpful for your task and that you eagerly want to keep exploring other possibilities of Docker. What you have been reading by now is only the beginning so keep on learning and things solving! 🙂