While trying to add network namespace support to OpenWrt’s
netifd, I ran into a severe lack of documentation about what named network namespaces in Linux actually are.
It turns out that while the Linux kernel has network namespaces, naming them is really an iproute2 thing, and other tools that also try to work with named network namespaces are best off emulating the iproute2 conventions.
ip-netns(8) manual page:
By convention a named network namespace is an object at
/var/run/netns/NAMEthat can be opened. The file descriptor resulting from opening
/var/run/netns/NAMErefers to the specified network namespace. Holding that file descriptor open keeps the network namespace alive. The file descriptor can be used with the setns(2) system call to change the network namespace associated with a task.
It also turns out that getting these conventions right is a lot of work, if you don’t want to call out to
So, let’s go through the actual code and see what all this fuss I am making is about.
This is from version 5.15.0.
We will assume the user said
ip netns add foo.
There is a lot hiding in those words. We will figure them all out below.
I will ignore the
create=false case here - that’s for
ip netns attach.
add case wants a name, and aborts if it did not get one.
attach case also wants a PID.
namespace.h, we find
#define NETNS_RUN_DIR "/var/run/netns".
Remember, we said we were tracing
ip netns add foo.
netns_path now holds
plus some error handling.
Those flags add up to
So far, so good. Now, buckle up!
Funny word, “likely”. This code puts in a lot of effort to make something only “likely”!
You should go read Debian bug 949235, I’ll wait.
Pretty standard cooperative locking code.
One of the many parts that any piece of software that wants to work with
ip’s named network namespaces needs to copy.
Let’s unpack this one.
A while-loop around
Weird, let’s hope there is an escape later!
I opened this bit of code hoping to learn how to netlink my way in and out of namespaces.
Now we are learning about mounts.
Aren’t computers fun?
The prototype for
int mount(const char *source, const char *target, const char *filesystemtype, unsigned long mountflags, const void *data);
So, we have: source
none, two flags ORed together, and no data.
What are we mounting?
It turns out this call only tries to change the parameters of an existing mount.
Specifically, these two flags are set:
- Make this mount point shared. Mount and unmount events immediately under this mount point will propagate to the other mount points that are members of this mount’s peer group. Propagation here means that the same mount or unmount will automatically occur under all of the other mount points in the peer group. Conversely, mount and unmount events that take place under peer mount points will propagate to this mount point.
- Used in conjunction with MS_BIND to create a recursive bind mount, and in conjunction with the propagation type flags to recursively change the propagation type of all of the mounts in a subtree. See below for further details.
So, this call attempts to implement the first large comment, about making it possible for “network namespace mounts to propagate between mount namespaces”.
If the code inside this
while-loop has run at least once since system startup (or since somebody destroyed
/var/run/netns, of course), this call succeeds and we continue after the loop.
But, let’s see what happens the first time.
Ah, the second round will always end - either in success or failure. Good.
/var/run/netns was not completely set up before (because this code runs for the first time), the
mount call indeed fails with
This is accepted only if
made_netns_run_dir_mount is false, which right now it is.
mount(2) manual page:
If mountflags includes MS_BIND (available since Linux 2.4), then perform a bind mount. A bind mount makes a file or a directory subtree visible at another point within the single directory hierarchy. Bind mounts may cross filesystem boundaries and span chroot(2) jails.
target are the same.
What does it mean?
The explanation is in the
pivot_root(2) man page, not in the mount man page!
A path that is not already a mount point can be converted into one by bind mounting the path onto itself.
So, this call converts
/var/run/netns to be a mount point, so that, through
MS_REC, this dir and any mounts inside it (we will get to those soon) can be shared between mount namespaces.
End of danger zone.
This simply creates a plain empty file.
The file does not need any special contents, so we immediately close it again.
This is where the real magic begins.
It turns out, there is no system call for just creating a new namespace and getting a handle to it.
(A handle for a namespace is an FD.
If the last FD holding onto a namespace is closed, the namespace disappears.)
The only way to create a new namespace is to call
This disconnects the calling process from the namespace it was in, and puts it in a fresh one.
CLONE_NEWNET argument tells the kernel that we only want this to happen for the network namespace.
ip never wanted to live in another namespace.
It only wanted to make one!
netns_save() looks like this:
saved_netns = open("/proc/self/ns/net", O_RDONLY | O_CLOEXEC);
The FD returned from
open becomes a handle to the network namespace
ip was started from.
In many cases, this will be the initial namespace, that does not have a name.
(Remember, the kernel does not know about names for namespaces at all!)
So, we get a handle to our original namespace, we disconnect from it (
unshare), and now the
ip process is running in an entirely new and fresh network namespace.
ip exited now, that new and fresh network namespace would immediately disappear!
So, we are not done yet.
That path looks familiar.
We used it a few lines ago to hold onto our original namespace.
But now, after
unshare, it points to a new namespace.
You can see this in
# ls -al /proc/self/ns/net lrwxrwxrwx 1 root root 0 Dec 5 18:40 /proc/self/ns/net -> net: root# ip netns exec foo ls -al /proc/self/ns/net lrwxrwxrwx 1 root root 0 Dec 5 18:40 /proc/self/ns/net -> net:
This output was taken after the full
ip netns add foo invocation.
4026532206 is the namespace we are currently holding on to in
4026532351 is our fresh namespace that we have not given a name yet.
And if we exit now, it would go away.
It turns out that giving it a name solves that problem too.
This is for the
attach case, where we use
ip netns to give a name to a namespace already existing because at least one process is in it.
With arguments filled in:
mount("/proc/self/ns/net", "/var/run/netns/foo", "none", MS_BIND, NULL)
It’s another bind mount.
We take the handle to our new namespace in
/proc, and mount it to the empty file we created earlier.
ip exits, the namespace does not disappear, because this mount is holding on to it for us!
And there we have it. One fresh namespace, freshly named. You can see it for yourself:
# mount tmpfs on /tmp/run/netns type tmpfs (rw,nosuid,nodev,noatime) nsfs on /tmp/run/netns/foo type nsfs (rw)
Further reading (some of which has been very helpful to me while writing this post):
- https://github.com/shuveb/containers-the-hard-way (which links to https://unixism.net/2020/06/containers-the-hard-way-gocker-a-mini-docker-written-in-go/)
- https://pkg.go.dev/github.com/vishvananda/netns#section-readme - note how it saves, makes, restores, just like
Update, December 11th 2021: after I posted this article, Dr. Jens Harbott sent me:
really nice writeup, I have a bonus question though: can you also explain what the “(id: 0)” etc. means for some netns?
I had not spotted any such thing during my tinkering, but together we figured it out:
# ip netns add foo # ip netns set foo 15 # ip netns list foo (id: 15)
So, it turns out Linux does somewhat have a concept of naming network namespaces - by number.
Autoassignment (starting from 0) is possible too.
(Reading and assigning these IDs is the only code in
ipnetns.c that uses Netlink!)