Essays

Docker Internals: A Comprehensive Guide. Containerization Mechanisms + Examples, Experiments, and Implementation

A deep-dive technical guide that demystifies Docker by exploring its foundational Linux mechanisms — chroot, namespaces, cgroups, and OverlayFS — with hands-on C++, Python, Go code examples and real experiments, showing that a container is just a regular process with preconfigured system attributes.

Docker is not magic — it is the skillful application of Linux mechanisms. In this guide, we will go from general to specific, from simple to complex, examining every fundamental technology that makes containers possible. Be prepared that some sections may require re-reading for full comprehension.

Foundations: Linux and System Calls

The Linux kernel simply takes the assembly codes of programs and executes them in sequence on the processor. But programs need access to system resources — files, network, memory. To prevent processes from destabilizing the system, the kernel provides an API: system calls. These are the controlled gateway through which a process can request resources from the operating system.

A program is code stored on disk. A process is a running instance of that program, with its own memory space and system resources. The kernel manages all of this. And Docker? Docker is just a process — with some preconfigured system attributes.

Chroot

The chroot system call makes a process believe that its root directory is not / but some other directory. This is the simplest form of filesystem isolation.

Let's try it in practice. First, create a minimal environment:

mkdir -p /tmp/myroot/bin /tmp/myroot/lib /tmp/myroot/lib64
cp /bin/bash /tmp/myroot/bin/
# Copy required shared libraries
ldd /bin/bash | grep -o '/lib.*\s' | xargs -I {} cp {} /tmp/myroot/{}
chroot /tmp/myroot /bin/bash

Now you are inside a chroot jail. The process sees /tmp/myroot as its root /. It cannot access files outside this directory. However, chroot alone is not enough for real isolation — a process with root privileges can escape a chroot. That's where namespaces come in.

Namespaces

Namespaces are an intermediary layer between a process's desire to access a resource and the resource itself. They are access points (descriptors) that a process uses to reach operating system resources. Every process has its own set of namespace references, stored in the kernel's nsproxy structure.

Linux supports several types of namespaces:

UTS Namespace — isolates the hostname. A container can have its own hostname different from the host.
Mount Namespace — isolates mount points. The container sees its own filesystem tree.
PID Namespace — isolates the process ID hierarchy. Inside the container, the first process has PID 1.
Network Namespace — isolates network interfaces, routing tables, firewall rules.
IPC Namespace — isolates inter-process communication mechanisms.
Cgroup Namespace — isolates the view of control groups.

A child process inherits the namespaces of its parent. But using the unshare system call (or the clone call with appropriate flags), a process can create its own copy of a namespace and detach from the parent's version.

Here's a C++ example demonstrating namespace creation:

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>

static char child_stack[1048576];

static int child_fn() {
    // This runs in new UTS and PID namespaces
    sethostname("container", 9);
    char hostname[256];
    gethostname(hostname, sizeof(hostname));
    printf("Child hostname: %s\n", hostname);
    printf("Child PID: %d\n", getpid()); // Will print 1
    return 0;
}

int main() {
    pid_t child_pid = clone(child_fn,
        child_stack + sizeof(child_stack),
        CLONE_NEWUTS | CLONE_NEWPID | SIGCHLD,
        NULL);
    if (child_pid == -1) {
        perror("clone");
        exit(1);
    }
    waitpid(child_pid, NULL, 0);
    return 0;
}

When compiled and run as root, this program creates a new process that lives in its own UTS and PID namespaces. It sees itself as PID 1 and can set its own hostname without affecting the host.

Cgroups (Control Groups)

While namespaces isolate what a process can see, cgroups limit how much of physical resources it can use: CPU time, memory, number of processes, and more.

Cgroups are managed through a special filesystem mounted at /sys/fs/cgroup. To create a control group and set limits:

# Create a new cgroup
mkdir /sys/fs/cgroup/mygroup

# Set memory limit to 50MB
echo 52428800 > /sys/fs/cgroup/mygroup/memory.max

# Set max number of processes to 20
echo 20 > /sys/fs/cgroup/mygroup/pids.max

# Add a process to this cgroup
echo $$ > /sys/fs/cgroup/mygroup/cgroup.procs

Now let's see what happens when a process exceeds the memory limit. Here's a Python script that deliberately allocates more than 50 MB:

import os

data = []
try:
    while True:
        data.append(b'A' * 1024 * 1024)  # Allocate 1 MB at a time
        print(f"Allocated {len(data)} MB")
except MemoryError:
    print("MemoryError caught")

When this script runs inside the cgroup with a 50 MB limit, the OOM (Out of Memory) Killer will terminate the process once it attempts to exceed the allocated memory. This is exactly how Docker enforces the --memory flag.

OverlayFS

OverlayFS is a union filesystem that creates a "layered cake" from multiple directory layers. The lower layer is read-only, the upper layer is read-write, and together they are presented as a single merged view.

mkdir -p /tmp/overlay/{lower,upper,work,merged}

# Create files in the lower layer
echo "base file" > /tmp/overlay/lower/base.txt

# Mount the overlay
mount -t overlay overlay \
  -o lowerdir=/tmp/overlay/lower,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work \
  /tmp/overlay/merged

# The merged directory shows both layers
ls /tmp/overlay/merged/
# Shows: base.txt

# Modifications go to the upper layer only
echo "modified" > /tmp/overlay/merged/base.txt
cat /tmp/overlay/lower/base.txt   # Still: "base file"
cat /tmp/overlay/upper/base.txt   # Shows: "modified"

This is fundamental to how Docker images work. Each layer of a Docker image is a read-only lower directory. When a container runs, Docker adds a writable upper layer on top. Changes only affect the upper layer, leaving the base image untouched. Multiple containers can share the same base image layers, saving disk space.

OCI Standards

The Open Container Initiative (OCI) defines two critical specifications:

Runtime Specification — describes how to run a container from an unpacked filesystem bundle.
Image Specification — describes the format of container images (layers, metadata, configuration).

The reference implementation of the OCI Runtime Spec is runc. You can use it directly:

# Create an OCI bundle
mkdir -p mycontainer/rootfs

# Export a Docker image filesystem into the bundle
docker export $(docker create alpine) | tar -C mycontainer/rootfs -xf -

# Generate a default OCI config
cd mycontainer
runc spec

# Run the container
runc run my-container-id

The config.json generated by runc spec contains all the container parameters: namespaces to create, cgroup limits, mount points, the process to execute, and more. This is the blueprint for a container.

Docker Architecture

Docker uses runc under the hood. When you run docker run, here's what happens at a high level:

The Docker daemon pulls the image (if not cached) — a set of compressed filesystem layers.
Layers are stored in /var/lib/docker/overlay2/ (for overlay2 storage driver).
Docker creates an OCI bundle: it assembles the layers into an OverlayFS mount and generates a config.json.
Docker calls containerd, which calls runc to actually create the container process.
The container process starts with its configured namespaces, cgroups, and filesystem view.

You can explore this yourself:

# Inspect image layers
docker inspect alpine | jq '.[0].RootFS'

# See the overlay mount for a running container
docker inspect <container_id> | jq '.[0].GraphDriver'

# Look at the actual files
ls /var/lib/docker/overlay2/

Each directory in /var/lib/docker/overlay2/ corresponds to an image layer or container layer, containing the diff (changed files), merged (combined view), work (internal use), and link (shortened identifiers) directories.

Putting It All Together

A container is not something alien — it's simply a process. A process that has:

Its own filesystem view (via chroot + OverlayFS + mount namespace)
Its own process tree (PID namespace)
Its own hostname (UTS namespace)
Its own network stack (network namespace)
Resource limits (cgroups)

Docker orchestrates all of this by leveraging standard Linux kernel features. There is no container "VM" — no hypervisor, no emulation. Just a process with some clever configuration.

Understanding these internals makes debugging containers far easier. When something goes wrong, you know exactly which layer of abstraction to investigate: is it a namespace issue, a cgroup limit, a filesystem overlay problem, or something else entirely?

The goal of this guide was to demystify Docker and show that containerization is a natural extension of the Linux kernel's capabilities. Docker simply packages these mechanisms into a convenient, user-friendly tool.