Introducing and using Docker for Data Scientists

But does it work on my machine?
This is an old meme in today’s community, especially for Data Scientists who want to send their amazing model of machine learning, to find out that the machine has a different behavior. The right distance.
However…
There is an answer for these strange things called containers and control tools such as Docker.
In this post, we’ll get down to the basics and how to build and run using Docker. Using Docker containers has become an industry standard and common practice for data products. As a Data Scientist, learning these tools is the most valuable tool in your arsenal.
Docker is a service that helps to create, run and deploy code and applications in containers.
Now you may be wondering, what is a container?
Obviously, a container is very similar to a virtual machine (VM). It’s a small remote location where everything is ‘fixed’ and can be controlled by any machine. The main selling point of VMs is their mobility, allowing your application or brand to run seamlessly on any server, local machine, or cloud platform such as. AWS.
The main difference between containers and VMs is how they use their computing resources. Containers are very lightweight because they do not share the hardware of the host machine. I won’t delve into the technical details here, but if you want to understand a little more, I’ve linked a good article explaining their differences here.
Docker is the tool we use to easily create, manage and run these containers. This is one of the main reasons why metal has become so popular, because it allows programmers to easily use programs and models that run anywhere.
There are three things we need to run a container using Docker:
- Dockerfile: A script file that contains instructions on how to build docker. picture
- Docker image: An image or template to create a Docker container.
- Docker container: A remote environment that provides everything a program or learning machine needs to run. It includes things like dependencies and OS versions.
There are also some important points to keep in mind:
- Docker Daemon: The reverse path (daemon) which deals with incoming requests to docker.
- Docker Client: A shell interface that enables the user to communicate with Docker through its daemon.
- DockerHub: Similar to GitHun, a place where developers can share their Docker images.
a person
The first thing you need to set up is Homemade food (link here). This is called ‘missing package manager for MacOS’ and is very useful for anyone who writes on their Mac.
To install Homebrew, just run the command provided on their website:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Verify Homebrew is installed and running brew help
.
Docker
Now with Homebrew installed, you can install docker by running it brew install docker
. Verify that docker is installed and running which docker
the output should not raise errors and look like this:
/opt/homebrew/bin/docker
Colima
The last step, is the setup Colima. In short, run install colima
and confirm that it was installed by which colima
. Again, the output should look like this:
/opt/homebrew/bin/colima
Now you might be wondering, what on earth is Colima?
Colima is a software program that helps container time on MacOS. In many ways, Colima creates a space for containers to work on our systems. To do this, it runs on a Linux system that has a daemon that Docker can connect to using the client-server model.
Alternatively, you can install Docker desktop instead of Colima. However, I like Colima for several reasons: its free, very light and I like working in the terminal!
Check out this blog post here for more information about Colima
Workflow
Below is an example of how Data Scientists and Machine Learning Engineers can deploy their model using Docker:
The first step is obviously to build their amazing model. Next, you need to wrap all the things you’re using to run the example, things like the python version and package dependencies. The last step is to use the required file inside the Dockerfile.
If this seems a little confusing to you at this point don’t worry, we’ll go over this step by step!
Basic Model
Let’s start by making a basic example. The code snippet provided shows a simple implementation of the Random Forest the type of groups in the common iris group:
Dataset from Kaggle with CC0.
This file is called basic_rf_model.py
for evidence.
Create the Required File
Now that we have our model ready, we need to create a requirement.txt
file to install all the dependencies that support the running of our model. In this simple example, we luckily rely only on scikit-learn
package. Therefore, our requirement.txt
it will look like this:
scikit-learn==1.2.2
You can check the version you are running on your computer by scikit-learn --version
command.
Create a Dockerfile
Now we can create our Dockerfile!
So, in the same way as for requirement.txt
and basic_rf_model.py
create a file named Dockerfile
. Inside Dockerfile
we will have the following:
Let’s go through it line by line to see what it means:
FROM python:3.9
: This is the base image of our imageMAINTAINER egor@some.email.com
: This shows who maintains the photoWORKDIR /src
: Sets the working directory of the image to srcCOPY . .
: Copy the current files to the Docker directoryRUN pip install -r requirements.txt
: Install the necessary fromrequirement.txt
file in the Docker environmentCMD ["python", "basic_rf_model.py"]
: They tell the container to issue a commandpython basic_rf_model.py
and run the model
Start Colima & Docker
The next step is to set up the Docker environment: First we need to start Colima:
colima start
After Colima starts, make sure the Docker commands are running:
docker ps
It should produce something like this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
This is great and means that Colima and Docker are working as expected!
Notice: and
docker ps
command lists all running containers.
Build a Picture
Now it’s time to build our first Docker Image from Dockerfile
what we have done above:
docker build . -t docker_medium_example
The -t
The flag shows the name of the image and .
tells us to create from this page.
If we now run docker images
we should see things like this:
Thank you, the image is created!
Run Container
After creating the image, we can run it as a container using IMAGE ID
is written above:
docker run bb59f770eb07
Output:
Accuracy: 0.9736842105263158
Because all he did was chase basic_rf_model.py
script!
More information
This tutorial focuses on what Docker can do and use. There are many features and rules to learn to understand Docker. My detailed tutorials are provided on the Docker site which you can find here.
One cool thing is that you can run the container in the interactive mode and enter its shell. For example, if we run:
docker run -it bb59f770eb07 /bin/bash
You will enter the Docker container and it should look like this:
We used it again ls
Command to display all files in the Docker working directory.
Docker is the best container and tools to ensure that Data Scientists models can run anywhere and anytime without problems. They do this by creating small sub-regions that have everything to make the model work. This is called a container. They are easy to use and lightweight, which makes them popular in today’s industries. In this article, we have gone through a basic example of how to deploy your instance into a container using Docker. The process was simple and straightforward, so it’s something that Data Scientists can learn and pick up quickly.
All the code used in this article is available on my GitHub here:
(All emojis created by OpenMoji – an open source project for emoji and images. Permission: CC BY-SA 4.0)