While trying to add network namespace support to OpenWrt’s netifd
, I ran into a severe lack of documentation about what named network namespaces in Linux actually are.
It turns out that while the Linux kernel has network namespaces, naming them is really an iproute2 thing, and other tools that also try to work with named network namespaces are best off emulating the iproute2 conventions.
From the ip-netns(8)
manual page:
By convention a named network namespace is an object at
/var/run/netns/NAME
that can be opened. The file descriptor resulting from opening/var/run/netns/NAME
refers to the specified network namespace. Holding that file descriptor open keeps the network namespace alive. The file descriptor can be used with the setns(2) system call to change the network namespace associated with a task.
It also turns out that getting these conventions right is a lot of work, if you don’t want to call out to ip netns
!
So, let’s go through the actual code and see what all this fuss I am making is about.
This is from version 5.15.0.
We will assume the user said ip netns add foo
.
|
|
There is a lot hiding in those words. We will figure them all out below.
|
|
I will ignore the create=false
case here - that’s for ip netns attach
.
|
|
The add
case wants a name, and aborts if it did not get one.
The attach
case also wants a PID.
|
|
In namespace.h
, we find #define NETNS_RUN_DIR "/var/run/netns"
.
Remember, we said we were tracing ip netns add foo
.
So netns_path
now holds "/var/run/netns/foo"
.
|
|
create_netns_dir()
is
mkdir(NETNS_RUN_DIR, S_IRWXU|S_IRGRP|S_IXGRP|S_IROTH|S_IXOTH))
plus some error handling.
Those flags add up to 755
.
So far, so good. Now, buckle up!
|
|
Funny word, “likely”. This code puts in a lot of effort to make something only “likely”!
|
|
You should go read Debian bug 949235, I’ll wait.
|
|
Pretty standard cooperative locking code.
One of the many parts that any piece of software that wants to work with ip
’s named network namespaces needs to copy.
|
|
Let’s unpack this one.
A while-loop around mount
?
Weird, let’s hope there is an escape later!
So, mount
.
I opened this bit of code hoping to learn how to netlink my way in and out of namespaces.
Now we are learning about mounts.
Aren’t computers fun?
The prototype for mount(2)
is:
int mount(const char *source, const char *target,
const char *filesystemtype, unsigned long mountflags,
const void *data);
So, we have: source ""
, target /var/run/netns
, fstype none
, two flags ORed together, and no data.
What are we mounting?
Nothing?
It turns out this call only tries to change the parameters of an existing mount.
Specifically, these two flags are set:
- MS_SHARED
- Make this mount point shared. Mount and unmount events immediately under this mount point will propagate to the other mount points that are members of this mount’s peer group. Propagation here means that the same mount or unmount will automatically occur under all of the other mount points in the peer group. Conversely, mount and unmount events that take place under peer mount points will propagate to this mount point.
- MS_REC
- Used in conjunction with MS_BIND to create a recursive bind mount, and in conjunction with the propagation type flags to recursively change the propagation type of all of the mounts in a subtree. See below for further details.
So, this call attempts to implement the first large comment, about making it possible for “network namespace mounts to propagate between mount namespaces”.
If the code inside this while
-loop has run at least once since system startup (or since somebody destroyed /var/run/netns
, of course), this call succeeds and we continue after the loop.
But, let’s see what happens the first time.
|
|
Ah, the second round will always end - either in success or failure. Good.
If /var/run/netns
was not completely set up before (because this code runs for the first time), the mount
call indeed fails with EINVAL
.
This is accepted only if made_netns_run_dir_mount
is false, which right now it is.
|
|
Another semi-magical mount
invocation.
What is MS_BIND
?
From the mount(2)
manual page:
If mountflags includes MS_BIND (available since Linux 2.4), then perform a bind mount. A bind mount makes a file or a directory subtree visible at another point within the single directory hierarchy. Bind mounts may cross filesystem boundaries and span chroot(2) jails.
But, source
and target
are the same.
What does it mean?
The explanation is in the pivot_root(2)
man page, not in the mount man page!
A path that is not already a mount point can be converted into one by bind mounting the path onto itself.
So, this call converts /var/run/netns
to be a mount point, so that, through MS_REC
, this dir and any mounts inside it (we will get to those soon) can be shared between mount namespaces.
|
|
End of danger zone.
|
|
Remember, netns_path
is /var/run/netns/foo
.
This simply creates a plain empty file.
|
|
The file does not need any special contents, so we immediately close it again.
|
|
This is where the real magic begins.
It turns out, there is no system call for just creating a new namespace and getting a handle to it.
(A handle for a namespace is an FD.
If the last FD holding onto a namespace is closed, the namespace disappears.)
The only way to create a new namespace is to call unshare
.
This disconnects the calling process from the namespace it was in, and puts it in a fresh one.
The CLONE_NEWNET
argument tells the kernel that we only want this to happen for the network namespace.
But! ip
never wanted to live in another namespace.
It only wanted to make one!
netns_save()
looks like this:
saved_netns = open("/proc/self/ns/net", O_RDONLY | O_CLOEXEC);
The FD returned from open
becomes a handle to the network namespace ip
was started from.
In many cases, this will be the initial namespace, that does not have a name.
(Remember, the kernel does not know about names for namespaces at all!)
So, we get a handle to our original namespace, we disconnect from it (unshare
), and now the ip
process is running in an entirely new and fresh network namespace.
If ip
exited now, that new and fresh network namespace would immediately disappear!
So, we are not done yet.
|
|
That path looks familiar.
We used it a few lines ago to hold onto our original namespace.
But now, after unshare
, it points to a new namespace.
You can see this in /proc
:
# ls -al /proc/self/ns/net
lrwxrwxrwx 1 root root 0 Dec 5 18:40 /proc/self/ns/net -> net:[4026532206]
root# ip netns exec foo ls -al /proc/self/ns/net
lrwxrwxrwx 1 root root 0 Dec 5 18:40 /proc/self/ns/net -> net:[4026532351]
This output was taken after the full ip netns add foo
invocation.
4026532206
is the namespace we are currently holding on to in saved_netns
.
4026532351
is our fresh namespace that we have not given a name yet.
And if we exit now, it would go away.
It turns out that giving it a name solves that problem too.
|
|
This is for the attach
case, where we use ip netns
to give a name to a namespace already existing because at least one process is in it.
|
|
Another magical mount
invocation.
With arguments filled in:
mount("/proc/self/ns/net", "/var/run/netns/foo", "none", MS_BIND, NULL)
It’s another bind mount.
We take the handle to our new namespace in /proc
, and mount it to the empty file we created earlier.
Now if ip
exits, the namespace does not disappear, because this mount is holding on to it for us!
|
|
And there we have it. One fresh namespace, freshly named. You can see it for yourself:
# mount
tmpfs on /tmp/run/netns type tmpfs (rw,nosuid,nodev,noatime)
nsfs on /tmp/run/netns/foo type nsfs (rw)
Further reading (some of which has been very helpful to me while writing this post):
- https://github.com/shuveb/containers-the-hard-way (which links to https://unixism.net/2020/06/containers-the-hard-way-gocker-a-mini-docker-written-in-go/)
- https://pkg.go.dev/github.com/vishvananda/netns#section-readme - note how it saves, makes, restores, just like
ip
does. - https://helda.helsinki.fi/bitstream/handle/10138/320475/Viding_Jasu_DemystifyingContainerNetworking_2020.pdf?sequence=2&isAllowed=y
Update, December 11th 2021: after I posted this article, Dr. Jens Harbott sent me:
really nice writeup, I have a bonus question though: can you also explain what the “(id: 0)” etc. means for some netns?
I had not spotted any such thing during my tinkering, but together we figured it out:
# ip netns add foo
# ip netns set foo 15
# ip netns list
foo (id: 15)
So, it turns out Linux does somewhat have a concept of naming network namespaces - by number.
Autoassignment (starting from 0) is possible too.
(Reading and assigning these IDs is the only code in ipnetns.c
that uses Netlink!)