kuniga.me > NP-Incompleteness > Namespace Jailing
02 Jul 2021
In a previous post we investigated a jail system using chroot with the conclusion that it was not a safe implementation. In this post we’ll study a safer alternative using Linux namespaces. We’ll develop a C++ application along the way.
The idea of Linux namespaces [1] is actually very close to that of a sandbox. We want to create subsystems within a system which are isolated, so if they’re tampered with, the hosting sytem is protected.
Linux allows sandboxing different pieces of its system. For example the user namespace consists of a separate set of users and groups.
There are at least 8 different namespaces available, but for the purposes of our simple sandbox, we’ll focus on 2: the user and mount namespaces.
We’ll develop a C++ application to create a jailed process. The idea is to define a base class that does the heavy-lifting and exposes some paremeters that children functions can configure.
The base class stub follows:
and a sample child class:
We’ll assume the existence of some utils functions:
The clone()
function is a general version of fork()
[2], which allows for more granular configuration. It can be used to start a new child process. It takes a few arguments:
f
f
This function will create a child process and make it execute f
with the provided arguments. The clone flags will determine what capabilities this child process will have, including what namespaces it will use. Let’s start with no flags for now.
The parent process will receive the child pid
from clone()
and continue its execution.
We’re mostly interested in f
for now. Assume we have a function allocate_stack()
that will allocate some memory for the stack available to the child process.
We want the child process to call a function in ShellProcess
, so we define an abstract function child_function()
which the child class has to implement. We also add child_function_wrapper()
so our base class can to execute some code when in the child process.
We can’t pass non-static methods as function pointers, so we pass this
as argument to the static function child_function_with_this()
.
The child class looks like:
We should be able to compile and run this, but we might not see any results because the parent process ends before the child can run. We need some synchonization.
We want the parent process to wait for the child to finish. We can wait for the SIGCHLD
signal, which the child will only emit if we pass the flag to SIGCHLD
to clone()
[2]:
Let’s use a better implementation for ShellProcess
so we can try out commands in a jailed environment.
In the example below we start a new shell, replacing the current child process. We customize it with a new PS1
so it’s more obvious when we are inside the child process.
We can try it out:
By default the child process has access to the same resources as the parent, include root access. We want to restrict that.
We’re ready for our first namespace, the user. We can simply do so by adding the CLONE_NEWUSER
to the flags passed to clone()
.
When we run:
The user metadata starts blank and 65534
represents undefined. Let’s fix this.
We can create a mapping between IDs inside the namespace and outside [3]. The map is stored in the file /proc/<pid>/uid_map
, where <pid>
is the ID of the current process [4].
So for example, if we have a process with PID 31378, we can inspect that file:
Each line represent one mapping. The meaning of each column is “ID_inside-ns”, “ID-outside-ns” and “length” [4]. These three numbers represent 2 ranges of the same length, the first is [ID_inside-ns, ID_inside-ns + length - 1]
and the second is [ID_outside-ns, ID_outside-ns + length - 1]
, and ids in the first range map to ids in the second range.
This is much easier to understand with an example, if we have a line with 10 1000 3
, it means the range of ids [10, 11, 12]
in the current process maps to the parent process [1000, 1001, 1002]
, thus 0 0 4294967295
(which is the default mapping) effectively represent a 1:1 mapping between every id.
We can create a simple map so that the user ID 0 in the child maps to our current user running the parent:
Then we write to the file corresponding to a given pid:
The tricky part is that the child process does not have privileges to write to its own uid_map
file, so it’s the parent that has to do it. Let’s assume we have a function before_child_runs()
that takes the child pid
and as the name suggests runs before the child. This is where we set the uid map:
To guarantee the right order of execution, we’ll need more synchronization.
We’ll use pipes for this as in [4]. A pipe pipe_fd
contains two file descriptors: pipe_fd[0]
is the read end of the pipe, and pipe_fd[1]
is the write end.
When we clone a process the child inherits a copy of the open file descriptors, so pipes can be used as a IPC (inter-process communication) medium. We can also use it as a synchronization mechanism, because the read()
function blocks until it receives the requested amount of data or the other side closes the file descriptor.
Now we can check the user is correct:
Note that we have to do the same for the group id, which is a very similar process but we’ll skip for the sake of simplicity.
Let’s also create a mount namespace by adding the CLONE_NEWNS
flag.
Differently from the user namespace which starts everything empty, the mount namespace starts with a copy of the host’s mount system, but we want to restrict that. We’ll define a new root for our filesystem and mount only a selected few paths on it, using pivot_root()
.
This is the most complicated part of the code, so let’s go over the high-level steps.
/
) a private mount point (it’s shared by default). This article goes over the different types of mount points. From [5]:These restrictions ensure that
pivot_root()
never propagates any changes to another mount namespace.
new_root
must be a path to a mount point, but can’t be “/”. A path that is not already a mount point can be converted into one by bind mounting the path onto itself
P
(provided by the child class) onto the new rootput_old
(under the new root), where the old root will be temporarily stored. From [5]:
put_old
must be at or underneathnew_root
new_root
the new root and it mounts the old root onto put_old
P
- It seems that pivot_root()
unmounts prior mounts so we have to remount. I don’t actually understand why we need to mount twice, but it only works if I do this, and this is also what nsjail does [7].Most of these steps are described as an example in the man page of pivot_root()
[6].
In code it will look like:
We need to make sure this function is run before child_function()
so we can do:
Warning: Make sure new_root()
is run by the child process and that the mount namespace is used! If you get permission denied and have to use sudo
you’re doing it wrong! (Speaking from experience >.<)
The child just needs to provide some paths that it would like to mount (read-only):
We should now have a minimal jailed system up and running!
The full example is available on Github.
In this post we went through all the details of creating a shell process with user and mount namespaces. Once we unmount the old root after pivot_root
, the old root does not stay around (though hidden) like it does via chroot [8].
The process of starting with everything disabled and painfully add capabilities is a great way to understand how things are implemented behind the scenes, for example the /proc/<pid>/uid_map
.
Ed King’s series [3] on Linux namespaces using Go is very instructive, where they use a higher-level API, which makes it easier to follow. The man pages from man7.org are very helpful, especially the examples!
I’d like to sandbox the network as well, but will leave it to a future post.