1. Overview
Docker is a framework designed to ease the creation, deployment, and general management of containers. Since a container can run any kind of application, we might want to also use it for a database management system (DBMS).
In this tutorial, we’ll review how to work with Docker to handle databases.
In the first part, we’ll cover the installation of a database management system on the local machine. Then, we’ll discover how data persistence works across containers. To conclude, we’ll discuss the reliability of implementing databases in Docker production environments.
2. Running a Docker Image Locally
To begin with, we need a DBMS within a container.
2.1. Starting With a Standard Docker Image
First, we install Docker Desktop. Then, we find an existing image of the chosen DBMS from the DockerHub. Once we locate it, we pick the docker pull command from the top right corner of the page.
In this case, we work with PostgreSQL:
$ docker pull postgres
When the download is complete, the docker run command creates a running database within a Docker container. For PostgreSQL, the POSTGRES_PASSWORD environment variable can be specified with the -e option:
$ docker run -e POSTGRES_PASSWORD=password postgres
Next, we test the database container connection.
2.2. Connecting a Java Project to the Database
Let’s quickly verify the DBMS connection via a local Java project to the database using a JDBC data source.
The connection string uses the default PostgreSQL port 5432 on localhost:
jdbc:postgresql://localhost:5432/postgres?user=postgres&password=password
An error should inform us that the port has not been opened. Indeed, the database listens for connections inside the container network while the Java project runs outside of it.
To fix this issue, we map the container port to a port on the localhost. For that, we use the same port, so we map 5432 on the local machine to 5432 in the container:
$ docker run -p 5432:5432 -e POSTGRES_PASSWORD=password postgres
The connection should be working now, so we’re able to use the JDBC data source.
2.3. Running SQL Scripts
So, let’s connect to the DBMS from a shell. For example, we can run an initialization script.
First, we find the running container ID:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
65d9163eece2 postgres "docker-entrypoint.s…" 27 minutes ago Up 27 minutes 0.0.0.0:5432->5432/tcp optimistic_hellman
Then, the docker exec command with the interactive -it option runs a shell inside the container:
$docker exec -it 65d9163eece2 bash
Finally, we can connect to the database instance with the command-line client and paste an SQL script:
root@65d9163eece2:/# psql -U postgres
postgres=#CREATE DATABASE TEST;
CREATE TABLE PERSON(
ID INTEGER PRIMARY KEY,
FIRST_NAME VARCHAR(1000),
LAST_NAME VARCHAR(1000)
);
...
For example, if we have a large dump file to load, we usually avoid copy-pasting. We can run the import command directly from the host instead with the docker exec command:
$ docker exec 65d9163eece2 psql -U postgres < dump.sql
This way, we also facilitate automation.
3. Persist Data With a Docker Volume
Of course, databases usually need to preserve data across restarts.
3.1. Why Do We Need Volumes?
The basic setup works as long as we use the same container, with docker container stop and start each time we need to reboot. Indeed, Docker persists data inside a temporary directory by default.
However, if we use docker run again, a new empty container is created, which removes the old data.
3.2. Docker Volumes Setup
The first task is to inspect the container and see which volume is used by the database:
$ docker inspect -f "{{ .Mounts }}" 65d9163eece2
[{volume f1033d3 /var/lib/docker/volumes/f1033d3/_data /var/lib/postgresql/data local true }]
In this case, volume f1033d3 maps the container directory /var/lib/postgresql/data to a temporary directory /var/lib/docker/volumes/f1033d3/_data created on the host filesystem.
By using the -v option of docker run, we can modify this mapping:
$ docker run -v C:\docker-db-volume:/var/lib/postgresql/data -e POSTGRES_PASSWORD=password postgres
Now, we can see the database files created in the C:\docker-db-volume directory.
As a result, each time we use the docker run command, the data should persist**.
Also, we may want to share the configuration between team members or across different environments. We can also use a Docker Compose file, which creates new containers each time making volumes mandatory.
4. Working With Docker in Production
Docker Compose is great for sharing configuration and managing containers as stateless services. If a service fails or can’t handle the workload, we can configure Docker Compose to create new containers automatically. This is very useful for building a production cluster for implementations like REST back-ends, which are stateless by design.
However, databases are stateful, and their management can be more complex. So, let’s review different setups.
4.1. Single Instance Database
Let’s suppose we’re building a non-critical environment, for testing or in production, that tolerates periods of downtime during deployments, backups, or failure.
In this case, we don’t need a high-availability cluster, and we can simply leverage Docker Compose for a single-instance database with a basic setup:
- simple volume for data storage, since containers run on the same machine
- limit runs to one container at a time using the global mode
Let’s see a minimal working example:
$ cat docker-compose.yaml
version: '3'
services:
database:
image: 'postgres'
deploy:
mode: global
environment:
- POSTGRES_PASSWORD=password
ports:
- "5432:5432"
volumes:
- "C:/docker-db-volume:/var/lib/postgresql/data"
Using this configuration, Docker Compose creates only one container at a time and reuses the data files from the C:\docker-db-volume directory.
However, it’s even more important in this configuration to make regular backups. In case of a configuration error, this directory could be erased or corrupted by the container.
4.2. Replicated Databases
Now, let’s assume now that the production environment is critical.
In this case, orchestration tools like Docker Swarm and Kubernetes are beneficial for stateless containers: They offer vertical and horizontal clustering, with load-balancing, fail-over, and auto-scaling capabilities.
Unfortunately, since database containers are stateful and these solutions don’t provide a volume replication mechanism, they won’t work in this case.
On the other hand, it can be dangerous to build homemade configurations because it can lead to severe data loss:
- using shared storage like NFS or NAS for volumes doesn’t prevent data loss when the database is restarted in another instance
- on master-slave clusters, it’s a common error to let a Docker orchestration unit elect more than one master node, which leads to data corruption
So, there are several different options:
- implement a database-specific or hardware replication mechanism
- subscribe to Platform-as-a-Service solutions like OpenShift, Amazon AWS, or Azure
- use a Docker-specific replication mechanism like KubeDB and Portworx
While not optimal if we want to use Docker without extensions directly, it’s always better to protect the setup than to try and force tools on a problem.
5. Whether to Use a Docker DBMS
After getting to know how we might leverage Docker alone for a DBMS setup, let’s dive deeper into why we often want to avoid that.
5.1. Benefits of Docker DBMS
To begin with, using a container provides an isolated minimal environment that runs the DBMS. This way, we can reproduce the same context on many machines and easily upgrade, migrate, and generally handle the setup.
In fact, most major DBMS environments are containerized on sites like DockerHub, available for download:
$ docker pull postgres
Since the data context is anonymous and ethereal, we don’t need to worry about transfers beyond the basic configuration and startup options.
Importantly, containers don’t reduce the performance of database management, especially when it comes to input and output.
5.2. Drawbacks of Docker DBMS
On the negative side, not all suppliers of DBMS implementations provide support for all containerized environments. This is mainly due to the fast-paced development of Docker and container systems in general. Still, it potentially restricts database production solutions based on containers.
Part of the reason is also that split-brain scenarios can be more common with containers.
Further, we might be tempted to create several DBMS instances of the same kind that work with the same files. This can be an issue since Docker can’t resolve race conditions beyond what the database system itself can do.
On a related note, we shouldn’t kill a database container as we would a stateless service container. This can result in serious data loss.
Finally, although container storage drivers are now mature, they still represent an abstraction that can be another point of failure.
6. Conclusion
In this article, we’ve reviewed the basic configuration suitable for development, testing, and non-critical production.
Finally, we concluded that Docker has drawbacks when used in high-availability environments. Therefore, it should be avoided or coupled with solutions specialized for database clusters.