Table of contents
Open Table of contents
Introduction
I’m sure all of you are well aware of Docker and how it works and allow us to design and implement applications based on microservices rather than a giant monolithic monster. Docker as a company dates from 2013, but the technology behind them is actually a native Linux technology which was implemented a few years prior to that, around 2008. Linux Containers or LXC is an OS-level virtualization which allows users to run multiple isolated Linux systems on a control host using one single Linux kernel. Before jumping to the quick tutorial, let’s dive into a little bit of theoretical background so you have a brief context of containers and which technologies they use, among other things.
Quoting the official LXC documentation:
“LXC is a userspace interface for the Linux kernel containment features. Through a powerful API and simple tools, it lets Linux users easily create and manage system or application containers.”
The Linux kernel, which is shared between the host machine and the LXC, provides a functionality known as control groups or cgroups, which allows the limitation and prioritization of resources without the need of starting a Virtual Machine (Just to be clear and point the difference between LXC’s and VM’s, LXC’s are significantly lighter weight than VM’s, mainly because the LXC kernel is shared with the host machine while the VM need a separate kernel). On the other hand, the Linux kernel also provides antoher isolation functionality known as namespaces which allows complete isolation of an application’s view of the host machine, including process trees, networking, user IDs and mounted file systems.
Docker can be introduced technically as an extension of the LXC’s capabilities. It is written in Go and is composed by a high-level API which provides lightweight virtualization. Similarly, Docker uses cgroups, namespaces and LXC’s. Docker acts as a portable container engine, packaging the application and all its dependencies in a virtual container that can run on any Linux server.
The main differences between Docker and LXC is that the former is designed to isolate one application in one container which allows users to isolate multiple applications in a server, whereas the latter is designed to isolate one OS in one container, which allows users to isolate multiple OS’s in a server.
cgroups
The main purpose of the cgroups is to allow the user to allocate resources - such as CPU time, system memory, network bandwidth or combinations of those resources - among user-defined groups of tasks (called processes) running on a system. The user can configure, monitor and limit the cgroups and the resources they provide.
The way cgroups are organized if hierarchical, like processes, and their attributes are inheritable by their child cgroups. However, there are some differences between the Linux processes model and the cgroups model. As we already know, the Linux processes model consist of a single tree of processes in which the root is the init
process executed by the kernel at boot time and which starts other processes. However, many different hierarchies of cgroups can exist simultaneously on a system, and instead of a single tree, the cgroup model is one or more separate, unconnected trees of tasks.
namespaces
In Linux there are mainly 7 different namespaces, which provide isolation of an application’s view of the host machine, as it was introduced earlier. Those namespaces are:
-
pid namespace: This namespace is the one that provides process isolation, i.e. isolate the process ID number space. This translates to processes in different PID namespaces having the same PID. This namespace allow containers to provide functionality such as suspending/resuming the set of processes in the container and migrating the container to a new host while the processes inside the container maintain the same PIDs. PIDs in a new PID namespace start at 1, somewhat like a standalone system, and calls to fork(2), vfork(2), or clone(2) will produce processes with PIDs that are unique within the namespace.
-
network namespace: This namespace is the one that provides network resources isolation: network devices, IPv4 and IPv6 protocol stacks, IP routing tables, firewall rules, the
/proc/net
directory (which is a symbolic link to/proc/PID/net
), the/sys/class/net
directory, various files under/proc/sys/net
, port numbers (sockets), and so on. A physical network device can live in exactly one network namespace. When a network namespace is freed, its physical network devices are moved back to the initial network namespace (not to the parent of the process). -
uts namespace: This namespace is the one that allows for the segregation of hostnames. It provides isolation of two system identifiers: the hostname and the NIS domain name. These identifiers are set using sethostname(2) and setdomainname(2), and can be retrieved using uname(2), gethostname(2), and getdomainname(2). Changes made to these identifiers are visible to all other processes in the same UTS namespace, but are not visible to processes in other UTS namespaces. As we already know, most communications to and from a host are done via the IP address and port number. However, it is easier if we have some name attached to a process.
-
user namespace: This namespace is the one that allows the system to restrict access to sensitive system files. It lets isolate security-related identifiers and attributes, in particular, user IDs and group IDs, the root directory, keys, and capabilities. A process’s user and group IDs can be different inside and outside a user namespace.
-
mount namespace: This namespace is the one that provides isolation of mount points such that processes in different namespaces cannot view each others’ files. If you are familiar with the
chroot
command, it functions similarly. So, basically it provides isolation of the list of mounts seen by the processes in each namespace instance. Thus, the processes in each of the mount namespace instances will see distinct single-directory hierarchies. -
ipc namespace: This namespace is the one that handles communication between processes by using shared memory areas, message queues, and semaphores. It isolates certain IPC resources, namely, System V IPC objects and (since Linux 2.6.30) POSIX message queues. The common characteristic of these IPC mechanisms is that IPC objects are identified by mechanisms other than filesystem pathnames. Each IPC namespace has its own set of System V IPC identifiers and its own POSIX message queue filesystem. Objects created in an IPC namespace are visible to all other processes that are members of that namespace, but are not visible to processes in other IPC namespaces. For a more detailed explanation fo this complex topic, please refer to the post at opensource.com by Marty Kalin.
-
cgroups namespace: This namespace was already introduced earlier, but to summarize, cgroups are a mechanism for controlling system resources such as CPU time, system memory, I/O operations, etc.
Write a container in Go
Alright, here it comes the fun part. The previous introduction sounds a little bit too theoretical, so let’s dive into the practical section and you will see that all of this is not that complex. This kind of abstract topics require some time to digest, but having a working example can help immensely to interiorize these concepts.
First of all, this whole tutorial will be using Go. It is a statically typed, compiled high-level programming language that produces a statically linked binary that you just have to send to your server, and it works (this is one of the reasons Docker is developed in Go - it is multi-platform).
Quick disclaimer: This tutorial is not intended to be a Golang tutorial, but I’m going to make a brief introduction so everyone can understand what we are doing here. Go programs are organized into packages, which is a collection of source files in the same directory that are compiled together. The primary package that is compiled is named main and the basic Hello, world! example in Go would be something like:
package main
import (
"fmt"
)
func main() {
fmt.println("Hello, world!")
}
This can be run by executing go run .
in the path where the file with the previous code is present.
If you want to learn more about Go, or just feel like you are not fully ready to comprehend the code presented below, you can follow A tour of Go, which offers a guided tutorial with a mix of theoretical and practical resources to get to know what Go is capable of.
Since Go is a statically typed language, we cannot write code snippets that use, for example, a function that has not been defined. So please, don’t exect the following code snippets to work if you just copy and paste them in your code editor. At the end of the post, you will have the complete version of the code.
First, we go through the go packages we will be using:
- fmt: Implements formatted I/O with functions analogous to C’s printf and scanf.
- os: Provides a platform-independent interface to operating system functionality.
- os/exec: It’s part of the
os
package as peros/
. It runs external commands and it’s a wrapper of os.StartProcess to make it easier to remap stdin and stdout, connect I/O with pipes, and do other adjustments. - syscall: Contains an interface to the low-level operating system primitives.
So, our code should include an import
section with those packages. Most code editors can be configured to write Go and the import section is usually automatically completed when you use a function from those packages.
Secondly, you should know that when running (after compiling) a Go program (or any other process), you’ll find that there is a file in /proc/self/exe
when the program/process is running. This is a special file that contains an in-memory copy of the current executable, i.e. it is a symlink that points to the executable file that started the currently running process. This can be used and referenced to call a program within itself. Does it sound like a container?
With the package exec we can execute /proc/self/exe
. exec.Command()
returns the Cmd struct to execute the named program with the given arguments, as seen in the parent
function below:
func parent() {
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
fmt.Println("ERROR", err)
os.Exit(1)
}
}
The key concept here is that the program re-executes itself with different arguments, creating a sort of isolation. When executed with run
, it starts a new instance of itself with child
, effectively creating two levels of processes. This is a basic form of isolation, a core feature of containerization. However, this code does not implement other aspects of containers like filesystem isolation, resource limiting, or complete process isolation as it is intented to emulate a simplified version of a contanier.
package main
import (
"fmt"
"os"
"os/exec"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
parent()
case "child":
child()
default:
panic("IDK what to do!")
}
}
func parent() {
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
fmt.Println("[ PARENT ERROR ]:", err)
os.Exit(1)
}
}
func child() {
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
fmt.Println("[ CHILD ERROR ]:", err)
os.Exit(1)
}
}
As we said, this can be considered a very naive container since it lacks the implementation of key concepts that make containers be what they are. Therefore, let’s add some more funcitonality to our simple “isolator”.
Adding namespaces
As we already know, Linux namespaces are a feature of the Linux kernel that provide a form of lightweight process isolation. In order to do this, we can use cmd.SysProcAttr
:
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWUSER,
}
Then, this will allow us to run our program inside the UTS, PID, MNT and USER namespaces. You can add this below the first line of the parent function, and you will find the complete version of the code at the end of the post.
Custom root FS
Right now, our container process is in isolated UTS, PID, MNT and USER namespaces but the filesystem is the same as the host, since we are inheriting it from the main (parent) namespace. Therefore, we need to choose a root FS. In order to do this, we need first a very simple function that will ensure that what we intend to do is complete with no issues, and we will call it must()
:
func must(err error) {
if err != nil {
panic(err)
}
}
Then, in order to use a root FS, we can add the following few lines of code right to the start of the child()
function:
must(syscall.Mount("rootfs", "rootfs", "", syscall.MS_BIND, ""))
must(os.MkdirAll("rootfs/oldrootfs", 0700))
must(syscall.PivotRoot("rootfs", "rootfs/oldrootfs"))
must(os.Chdir("/"))
In these lines of code, what’s happening is explained below, line by line:
- First line of code we perform a bind mount, which is a way of remapping an existing part of the file system hierarchy. In this case, it’s remapping the “rootfs” directory onto itself. This is a common step in container setup to ensure that modifications to the filesystem are confined to a specific directory (“rootfs” in this case).
- Then, we create a new directory named “oldrootfs” inside the “rootfs” directory, by using the
MkdirAll
function whcih is used to ensure that the directory is created along with any necessary parent directories and the0700
permission setting which ensures that only the owner of the directory has read, write, and execute permissions. - We hange the root filesystem of the calling process to “rootfs” and moves the old root filesystem to “rootfs/oldrootfs”.
PivotRoot
is crucial in containers for isolating the filesystem: it effectively makes “rootfs” the new root of the filesystem for the process, and the old root is now accessible under “rootfs/oldrootfs”. This step is key in ensuring that the process running inside the container sees a filesystem that is isolated from the host. - Finally, we change the current working directory of the process to the new root (”/”). This is important as the previous operations have changed the filesystem layout, and ensuring that the current directory is set correctly is crucial for the process to operate normally within its new environment.
Final code
And there we have it, the final version of the code with everything glued together is presented below:
package main
import (
"fmt"
"os"
"os/exec"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
parent()
case "child":
child()
default:
panic("IDK what to do!")
}
}
func parent() {
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWUSER,
}
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
fmt.Println("[ PARENT ERROR ]:", err)
os.Exit(1)
}
}
func child() {
// root FS operations
must(syscall.Mount("rootfs", "rootfs", "", syscall.MS_BIND, ""))
must(os.MkdirAll("rootfs/oldrootfs", 0700))
must(syscall.PivotRoot("rootfs", "rootfs/oldrootfs"))
must(os.Chdir("/"))
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
fmt.Println("[ CHILD ERROR ]:", err)
os.Exit(1)
}
}
func must(err error) {
if err != nil {
panic(err)
}
}
And that is pretty much all for this time! We went through the theoretical background needed to fully grasp the concept of containers in Linux environments, as well as a very simple yet very useful practical example which lets you create your own container in less that 60 lines of code. I hope you enjoyed this post and learned something, and be sure to follow me on X/Twitter so you get to know when the next post will be uploaded or just provide me with your thoughts about this one.
Sources
- man-pages
- Docker
- Linux Containers
- Special thanks to Julian Friedman for this great article from which I got the idea (and the code!) for this post.