02 Jul 2021
In a previous post we investigated a jail system using chroot with the conclusion that it was not a safe implementation. In this post we’ll study a safer alternative using Linux namespaces. We’ll develop a C++ application along the way.
The idea of Linux namespaces  is actually very close to that of a sandbox. We want to create subsystems within a system which are isolated, so if they’re tampered with, the hosting sytem is protected.
Linux allows sandboxing different pieces of its system. For example the user namespace consists of a separate set of users and groups.
There are at least 8 different namespaces available, but for the purposes of our simple sandbox, we’ll focus on 2: the user and mount namespaces.
We’ll develop a C++ application to create a jailed process. The idea is to define a base class that does the heavy-lifting and exposes some paremeters that children functions can configure.
The base class stub follows:
and a sample child class:
We’ll assume the existence of some utils functions:
clone() function is a general version of
fork() , which allows for more granular configuration. It can be used to start a new child process. It takes a few arguments:
This function will create a child process and make it execute
f with the provided arguments. The clone flags will determine what capabilities this child process will have, including what namespaces it will use. Let’s start with no flags for now.
The parent process will receive the child
clone() and continue its execution.
We’re mostly interested in
f for now. Assume we have a function
allocate_stack() that will allocate some memory for the stack available to the child process.
We want the child process to call a function in
ShellProcess, so we define an abstract function
child_function() which the child class has to implement. We also add
child_function_wrapper() so our base class can to execute some code when in the child process.
We can’t pass non-static methods as function pointers, so we pass
this as argument to the static function
The child class looks like:
We should be able to compile and run this, but we might not see any results because the parent process ends before the child can run. We need some synchonization.
We want the parent process to wait for the child to finish. We can wait for the
SIGCHLD signal, which the child will only emit if we pass the flag to
Let’s use a better implementation for
ShellProcess so we can try out commands in a jailed environment.
In the example below we start a new shell, replacing the current child process. We customize it with a new
PS1 so it’s more obvious when we are inside the child process.
We can try it out:
By default the child process has access to the same resources as the parent, include root access. We want to restrict that.
We’re ready for our first namespace, the user. We can simply do so by adding the
CLONE_NEWUSER to the flags passed to
When we run:
The user metadata starts blank and
65534 represents undefined. Let’s fix this.
We can create a mapping between IDs inside the namespace and outside . The map is stored in the file
<pid> is the ID of the current process .
So for example, if we have a process with PID 31378, we can inspect that file:
Each line represent one mapping. The meaning of each column is “ID_inside-ns”, “ID-outside-ns” and “length” . These three numbers represent 2 ranges of the same length, the first is
[ID_inside-ns, ID_inside-ns + length - 1] and the second is
[ID_outside-ns, ID_outside-ns + length - 1], and ids in the first range map to ids in the second range.
This is much easier to understand with an example, if we have a line with
10 1000 3, it means the range of ids
[10, 11, 12] in the current process maps to the parent process
[1000, 1001, 1002], thus
0 0 4294967295 (which is the default mapping) effectively represent a 1:1 mapping between every id.
We can create a simple map so that the user ID 0 in the child maps to our current user running the parent:
Then we write to the file corresponding to a given pid:
The tricky part is that the child process does not have privileges to write to its own
uid_map file, so it’s the parent that has to do it. Let’s assume we have a function
before_child_runs() that takes the child
pid and as the name suggests runs before the child. This is where we set the uid map:
To guarantee the right order of execution, we’ll need more synchronization.
We’ll use pipes for this as in . A pipe
pipe_fd contains two file descriptors:
pipe_fd is the read end of the pipe, and
pipe_fd is the write end.
When we clone a process the child inherits a copy of the open file descriptors, so pipes can be used as a IPC (inter-process communication) medium. We can also use it as a synchronization mechanism, because the
read() function blocks until it receives the requested amount of data or the other side closes the file descriptor.
Now we can check the user is correct:
Note that we have to do the same for the group id, which is a very similar process but we’ll skip for the sake of simplicity.
Let’s also create a mount namespace by adding the
Differently from the user namespace which starts everything empty, the mount namespace starts with a copy of the host’s mount system, but we want to restrict that. We’ll define a new root for our filesystem and mount only a selected few paths on it, using
This is the most complicated part of the code, so let’s go over the high-level steps.
/) a private mount point (it’s shared by default). This article goes over the different types of mount points. From :
These restrictions ensure that
pivot_root()never propagates any changes to another mount namespace.
new_rootmust be a path to a mount point, but can’t be “/”. A path that is not already a mount point can be converted into one by bind mounting the path onto itself
P(provided by the child class) onto the new root
put_old(under the new root), where the old root will be temporarily stored. From :
put_oldmust be at or underneath
new_rootthe new root and it mounts the old root onto
P- It seems that
pivot_root()unmounts prior mounts so we have to remount. I don’t actually understand why we need to mount twice, but it only works if I do this, and this is also what nsjail does .
Most of these steps are described as an example in the man page of
In code it will look like:
We need to make sure this function is run before
child_function() so we can do:
Warning: Make sure
new_root() is run by the child process and that the mount namespace is used! If you get permission denied and have to use
sudo you’re doing it wrong! (Speaking from experience >.<)
The child just needs to provide some paths that it would like to mount (read-only):
We should now have a minimal jailed system up and running!
The full example is available on Github.
In this post we went through all the details of creating a shell process with user and mount namespaces. Once we unmount the old root after
pivot_root, the old root does not stay around (though hidden) like it does via chroot .
The process of starting with everything disabled and painfully add capabilities is a great way to understand how things are implemented behind the scenes, for example the
Ed King’s series  on Linux namespaces using Go is very instructive, where they use a higher-level API, which makes it easier to follow. The man pages from man7.org are very helpful, especially the examples!
I’d like to sandbox the network as well, but will leave it to a future post.