Podman , part of the libpod library, enables users to manage pods, containers, and container images. In my last article, I wrote about Podman as a more secure way to run containers . Here, I'll explain how to use Podman to run containers in separate user namespaces.
I have always thought of user namespace , primarily developed by Red Hat's Eric Biederman, as a great feature for separating containers. User namespace allows you to specify a user identifier (UID) and group identifier (GID) mapping to run your containers. This means you can run as UID 0 inside the container and UID 100000 outside the container. If your container processes escape the container, the kernel will treat them as UID 100000. Not only that, but any file object owned by a UID that isn't mapped into the user namespace will be treated as owned by "nobody" (65534, kernel.overflowuid), and the container process will not be allowed access unless the object is accessible by "other" (world readable/writable).
If you have a file owned by "real" root with permissions 660 , and the container processes in the user namespace attempt to read it, they will be prevented from accessing it and will see the file as owned by nobody.
An exampleHere's how that might work. First, I create a file in my system owned by root.
$ sudo echo “Test” > /tmp/test
$ sudo # chmod 600 /tmp/test
$ sudo ls -l /tmp/test
-rw-rw----. 1 root root 8 Nov 30 07:40 /tmp/test
Next, I volume-mount the file into a container running with a user namespace map 0:100000:5000.
$ sudo podman run -ti -v /tmp/test:/tmp/test:Z --uidmap 0:100000:5000 fedora sh
# id
uid=0(root) gid=0(root) groups=0(root)
# ls -l /tmp/test
-rw-rw----. 1 nobody nobody 8 Nov 30 12:40 /tmp/test
# cat /tmp/test
cat: /tmp/test: Permission denied
The --uidmap setting above tells Podman to map a range of 5000 UIDs inside the container, starting with UID 100000 outside the container (so the range is 100000-104999) to a range starting at UID 0 inside the container (so the range is 0-4999). Inside the container, if my process is running as UID 1, it is 100001 on the host
Since the real UID=0 is not mapped into the container, any file owned by root will be treated asowned by nobody. Even if the process inside the container has CAP_DAC_OVERRIDE , it can't override this protection. DAC_OVERRIDE enables root processes to read/write any file on the system, even if the process was not owned by root or world readable or writable.
User namespace capabilities are not the same as capabilities on the host. They are namespaced capabilities. This means my container root has capabilities only within the container―really only across the range of UIDs that were mapped into the user namespace. If a container process escaped the container, it wouldn't have any capabilities over UIDs not mapped into the user namespace, including UID=0. Even if the processes could somehow enter another container, they would not have those capabilities if the container uses a different range of UIDs.
Note that SElinux and other technologies also limit what would happen if a container process broke out of the container.
Using `podman top` to show user namespacesWe have added features to podman top to allow you to examine the usernames of processes running inside a container and identify their real UIDs on the host.
Let's start by running a sleep container using our UID mapping.
$ sudo podman run --uidmap 0:100000:5000 -d fedora sleep 1000Now run podman top :
$ sudo podman top --latest user huser
USER HUSER
root 100000
$ ps -ef | grep sleep
100000 21821 21809 0 08:04 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
Notice podman top reports that the user process is running as root inside the container but as UID 100000 on the host (HUSER). Also the ps command confirms that the sleep process is running as UID 100000.
Now let's run a second container, but this time we will choose a separate UID map starting at 200000.
$ sudo podman run --uidmap 0:200000:5000 -d fedora sleep 1000
$ sudo podman top --latest user huser
USER HUSER
root 200000
$ ps -ef | grep sleep
100000 21821 21809 0 08:04 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
200000 23644 23632 1 08:08 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 1000
Notice that podman top reports the second container is running as root inside the container but as UID=200000 on the host.
Also look at the ps command―it shows both sleep processes running: one as 100000 and the other as 200000.
This means running the containers inside separate user namespaces gives you traditional UID separation between processes, which has been the standard security tool of Linux/Unix from the beginning.
Problems with user namespacesFor several years, I've advocated user namespace as the security tool everyone wants but hardly anyone has used. The reason is there hasn't been any filesystem support or a shifting file system.
In containers, you want to share the base image between lots of containers. The examples above use the Fedora base image in each example. Most of the files in the Fedora image are owned by real UID=0. If I run a container on this image with the user namespace 0:100000:5000, by default it sees all of these files as owned by nobody, so we need to shift all of these UIDs to match the user namespace. For years, I've wanted a mount option to tell the kernel to remap these file UIDs to match the user namespace. Upstream kernel storage developers continue to investigate and make progress on this feature, but it is a difficult problem.
Linux Containers
What are Linux containers? What is Docker? What is Kubernetes? An introduction to container terminology Podman can use different user namespaces on the same image because of automatic chowning built into