What does ip netns add actually do?

While trying to add network namespace support to OpenWrt’s netifd, I ran into a severe lack of documentation about what named network namespaces in Linux actually are. It turns out that while the Linux kernel has network namespaces, naming them is really an iproute2 thing, and other tools that also try to work with named network namespaces are best off emulating the iproute2 conventions.

From the ip-netns(8) manual page:

By convention a named network namespace is an object at /var/run/netns/NAME that can be opened. The file descriptor resulting from opening /var/run/netns/NAME refers to the specified network namespace. Holding that file descriptor open keeps the network namespace alive. The file descriptor can be used with the setns(2) system call to change the network namespace associated with a task.

It also turns out that getting these conventions right is a lot of work, if you don’t want to call out to ip netns!

So, let’s go through the actual code and see what all this fuss I am making is about. This is from version 5.15.0. We will assume the user said ip netns add foo.

798
799
800
801
802
803


static int netns_add(int argc, char **argv, bool create)
{
	/* This function creates a new network namespace and
	 * a new mount namespace and bind them into a well known
	 * location in the filesystem based on the name provided.
	 *

There is a lot hiding in those words. We will figure them all out below.

804
805
806


	 * If create is true, a new namespace will be created,
	 * otherwise an existing one will be attached to the file.
	 *

I will ignore the create=false case here - that’s for ip netns attach.

807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834


	 * The mount namespace is created so that any necessary
	 * userspace tweaks like remounting /sys, or bind mounting
	 * a new /etc/resolv.conf can be shared between users.
	 */
	char netns_path[PATH_MAX], proc_path[PATH_MAX];
	const char *name;
	pid_t pid;
	int fd;
	int lock;
	int made_netns_run_dir_mount = 0;

	if (create) {
		if (argc < 1) {
			fprintf(stderr, "No netns name specified\n");
			return -1;
		}
	} else {
		if (argc < 2) {
			fprintf(stderr, "No netns name and PID specified\n");
			return -1;
		}

		if (get_s32(&pid, argv[1], 0) || !pid) {
			fprintf(stderr, "Invalid PID: %s\n", argv[1]);
			return -1;
		}
	}
	name = argv[0];

The add case wants a name, and aborts if it did not get one. The attach case also wants a PID.

836

	snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name);

In namespace.h, we find #define NETNS_RUN_DIR "/var/run/netns". Remember, we said we were tracing ip netns add foo. So netns_path now holds "/var/run/netns/foo".

838
839


	if (create_netns_dir())
		return -1;

create_netns_dir() is

mkdir(NETNS_RUN_DIR, S_IRWXU|S_IRGRP|S_IXGRP|S_IROTH|S_IXOTH))

plus some error handling. Those flags add up to 755.

So far, so good. Now, buckle up!

841
842
843
844
845


	/* Make it possible for network namespace mounts to propagate between
	 * mount namespaces.  This makes it likely that a unmounting a network
	 * namespace file in one namespace will unmount the network namespace
	 * file in all namespaces allowing the network namespace to be freed
	 * sooner.

Funny word, “likely”. This code puts in a lot of effort to make something only “likely”!

846
847
848
849
850
851


	 * These setup steps need to happen only once, as if multiple ip processes
	 * try to attempt the same operation at the same time, the mountpoints will
	 * be recursively created multiple times, eventually causing the system
	 * to lock up. For example, this has been observed when multiple netns
	 * namespaces are created in parallel at boot. See:
	 * https://bugs.debian.org/949235

You should go read Debian bug 949235, I’ll wait.

852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867


	 * Try to take an exclusive file lock on the top level directory to ensure
	 * this cannot happen, but proceed nonetheless if it cannot happen for any
	 * reason.
	 */
	lock = open(NETNS_RUN_DIR, O_RDONLY|O_DIRECTORY, 0);
	if (lock < 0) {
		fprintf(stderr, "Cannot open netns runtime directory \"%s\": %s\n",
			NETNS_RUN_DIR, strerror(errno));
		return -1;
	}
	if (flock(lock, LOCK_EX) < 0) {
		fprintf(stderr, "Warning: could not flock netns runtime directory \"%s\": %s\n",
			NETNS_RUN_DIR, strerror(errno));
		close(lock);
		lock = -1;
	}

Pretty standard cooperative locking code. One of the many parts that any piece of software that wants to work with ip’s named network namespaces needs to copy.

868

	while (mount("", NETNS_RUN_DIR, "none", MS_SHARED | MS_REC, NULL)) {

Let’s unpack this one. A while-loop around mount? Weird, let’s hope there is an escape later!

So, mount. I opened this bit of code hoping to learn how to netlink my way in and out of namespaces. Now we are learning about mounts. Aren’t computers fun?

The prototype for mount(2) is:

int mount(const char *source, const char *target,
          const char *filesystemtype, unsigned long mountflags,
          const void *data);

So, we have: source "", target /var/run/netns, fstype none, two flags ORed together, and no data. What are we mounting? Nothing? It turns out this call only tries to change the parameters of an existing mount. Specifically, these two flags are set:

MS_SHARED: Make this mount point shared. Mount and unmount events immediately under this mount point will propagate to the other mount points that are members of this mount’s peer group. Propagation here means that the same mount or unmount will automatically occur under all of the other mount points in the peer group. Conversely, mount and unmount events that take place under peer mount points will propagate to this mount point.
MS_REC: Used in conjunction with MS_BIND to create a recursive bind mount, and in conjunction with the propagation type flags to recursively change the propagation type of all of the mounts in a subtree. See below for further details.

So, this call attempts to implement the first large comment, about making it possible for “network namespace mounts to propagate between mount namespaces”.

If the code inside this while-loop has run at least once since system startup (or since somebody destroyed /var/run/netns, of course), this call succeeds and we continue after the loop. But, let’s see what happens the first time.

869
870


		/* Fail unless we need to make the mount point */
		if (errno != EINVAL || made_netns_run_dir_mount) {

Ah, the second round will always end - either in success or failure. Good.

If /var/run/netns was not completely set up before (because this code runs for the first time), the mount call indeed fails with EINVAL. This is accepted only if made_netns_run_dir_mount is false, which right now it is.

871
872
873
874
875
876
877
878
879
880
881


			fprintf(stderr, "mount --make-shared %s failed: %s\n",
				NETNS_RUN_DIR, strerror(errno));
			if (lock != -1) {
				flock(lock, LOCK_UN);
				close(lock);
			}
			return -1;
		}

		/* Upgrade NETNS_RUN_DIR to a mount point */
		if (mount(NETNS_RUN_DIR, NETNS_RUN_DIR, "none", MS_BIND | MS_REC, NULL)) {

Another semi-magical mount invocation. What is MS_BIND? From the mount(2) manual page:

If mountflags includes MS_BIND (available since Linux 2.4), then perform a bind mount. A bind mount makes a file or a directory subtree visible at another point within the single directory hierarchy. Bind mounts may cross filesystem boundaries and span chroot(2) jails.

But, source and target are the same. What does it mean? The explanation is in the pivot_root(2) man page, not in the mount man page!

A path that is not already a mount point can be converted into one by bind mounting the path onto itself.

So, this call converts /var/run/netns to be a mount point, so that, through MS_REC, this dir and any mounts inside it (we will get to those soon) can be shared between mount namespaces.

882
883
884
885
886
887
888
889
890
891
892
893
894
895


			fprintf(stderr, "mount --bind %s %s failed: %s\n",
				NETNS_RUN_DIR, NETNS_RUN_DIR, strerror(errno));
			if (lock != -1) {
				flock(lock, LOCK_UN);
				close(lock);
			}
			return -1;
		}
		made_netns_run_dir_mount = 1;
	}
	if (lock != -1) {
		flock(lock, LOCK_UN);
		close(lock);
	}

End of danger zone.

897
898


	/* Create the filesystem state */
	fd = open(netns_path, O_RDONLY|O_CREAT|O_EXCL, 0);

Remember, netns_path is /var/run/netns/foo. This simply creates a plain empty file.

899
900
901
902
903
904


	if (fd < 0) {
		fprintf(stderr, "Cannot create namespace file \"%s\": %s\n",
			netns_path, strerror(errno));
		return -1;
	}
	close(fd);

The file does not need any special contents, so we immediately close it again.

906
907
908


	if (create) {
		netns_save();
		if (unshare(CLONE_NEWNET) < 0) {

This is where the real magic begins. It turns out, there is no system call for just creating a new namespace and getting a handle to it. (A handle for a namespace is an FD. If the last FD holding onto a namespace is closed, the namespace disappears.) The only way to create a new namespace is to call unshare. This disconnects the calling process from the namespace it was in, and puts it in a fresh one. The CLONE_NEWNET argument tells the kernel that we only want this to happen for the network namespace.

But! ip never wanted to live in another namespace. It only wanted to make one!

netns_save() looks like this:

saved_netns = open("/proc/self/ns/net", O_RDONLY | O_CLOEXEC);

The FD returned from open becomes a handle to the network namespace ip was started from. In many cases, this will be the initial namespace, that does not have a name. (Remember, the kernel does not know about names for namespaces at all!)

So, we get a handle to our original namespace, we disconnect from it (unshare), and now the ip process is running in an entirely new and fresh network namespace.

If ip exited now, that new and fresh network namespace would immediately disappear! So, we are not done yet.

909
910
911
912
913
914


			fprintf(stderr, "Failed to create a new network namespace \"%s\": %s\n",
				name, strerror(errno));
			goto out_delete;
		}

		strcpy(proc_path, "/proc/self/ns/net");

That path looks familiar. We used it a few lines ago to hold onto our original namespace. But now, after unshare, it points to a new namespace.

You can see this in /proc:

# ls -al /proc/self/ns/net
lrwxrwxrwx    1 root     root             0 Dec  5 18:40 /proc/self/ns/net -> net:[4026532206]
root# ip netns exec foo ls -al /proc/self/ns/net
lrwxrwxrwx    1 root     root             0 Dec  5 18:40 /proc/self/ns/net -> net:[4026532351]

This output was taken after the full ip netns add foo invocation. 4026532206 is the namespace we are currently holding on to in saved_netns. 4026532351 is our fresh namespace that we have not given a name yet. And if we exit now, it would go away. It turns out that giving it a name solves that problem too.

915
916
917


	} else {
		snprintf(proc_path, sizeof(proc_path), "/proc/%d/ns/net", pid);
	}

This is for the attach case, where we use ip netns to give a name to a namespace already existing because at least one process is in it.

919
920


	/* Bind the netns last so I can watch for it */
	if (mount(proc_path, netns_path, "none", MS_BIND, NULL) < 0) {

Another magical mount invocation. With arguments filled in:

mount("/proc/self/ns/net", "/var/run/netns/foo", "none", MS_BIND, NULL)

It’s another bind mount. We take the handle to our new namespace in /proc, and mount it to the empty file we created earlier. Now if ip exits, the namespace does not disappear, because this mount is holding on to it for us!

921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937


		fprintf(stderr, "Bind %s -> %s failed: %s\n",
			proc_path, netns_path, strerror(errno));
		goto out_delete;
	}
	netns_restore();

	return 0;
out_delete:
	if (create) {
		netns_restore();
		netns_delete(argc, argv);
	} else if (unlink(netns_path) < 0) {
		fprintf(stderr, "Cannot remove namespace file \"%s\": %s\n",
			netns_path, strerror(errno));
	}
	return -1;
}

And there we have it. One fresh namespace, freshly named. You can see it for yourself:

# mount
tmpfs on /tmp/run/netns type tmpfs (rw,nosuid,nodev,noatime)
nsfs on /tmp/run/netns/foo type nsfs (rw)

Further reading (some of which has been very helpful to me while writing this post):

https://github.com/shuveb/containers-the-hard-way (which links to https://unixism.net/2020/06/containers-the-hard-way-gocker-a-mini-docker-written-in-go/)
https://pkg.go.dev/github.com/vishvananda/netns#section-readme - note how it saves, makes, restores, just like ip does.
https://helda.helsinki.fi/bitstream/handle/10138/320475/Viding_Jasu_DemystifyingContainerNetworking_2020.pdf?sequence=2&isAllowed=y

Update, December 11th 2021: after I posted this article, Dr. Jens Harbott sent me:

really nice writeup, I have a bonus question though: can you also explain what the “(id: 0)” etc. means for some netns?

I had not spotted any such thing during my tinkering, but together we figured it out:

# ip netns add foo
# ip netns set foo 15
# ip netns list
foo (id: 15)

So, it turns out Linux does somewhat have a concept of naming network namespaces - by number. Autoassignment (starting from 0) is possible too. (Reading and assigning these IDs is the only code in ipnetns.c that uses Netlink!)