Running Kubernetes Node Components as a Non-root User

FEATURE STATE: Kubernetes v1.22 [alpha]

This document describes how to run Kubernetes Node components such as kubelet, CRI, OCI, and CNI without root privileges, by using a user namespace.

This technique is also known as rootless mode.

Note:

This document describes how to run Kubernetes Node components (and hence pods) as a non-root user.

If you are just looking for how to run a pod as a non-root user, see SecurityContext.

Before you begin

Your Kubernetes server must be at or later than version 1.22. To check the version, enter kubectl version.

Enable Cgroup v2
Enable systemd with user session
Configure several sysctl values, depending on host Linux distribution
Ensure that your unprivileged user is listed in /etc/subuid and /etc/subgid
Enable the KubeletInUserNamespace feature gate

Running Kubernetes inside Rootless Docker/Podman

kind

kind supports running Kubernetes inside Rootless Docker or Rootless Podman.

See Running kind with Rootless Docker.

minikube

minikube also supports running Kubernetes inside Rootless Docker or Rootless Podman.

See the Minikube documentation:

Running Kubernetes inside Unprivileged Containers

Note: This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the content guide before submitting a change. More information.

sysbox

Sysbox is an open-source container runtime (similar to "runc") that supports running system-level workloads such as Docker and Kubernetes inside unprivileged containers isolated with the Linux user namespace.

See Sysbox Quick Start Guide: Kubernetes-in-Docker for more info.

Sysbox supports running Kubernetes inside unprivileged containers without requiring Cgroup v2 and without the KubeletInUserNamespace feature gate. It does this by exposing specially crafted /proc and /sys filesystems inside the container plus several other advanced OS virtualization techniques.

Running Rootless Kubernetes directly on a host

K3s

K3s experimentally supports rootless mode.

See Running K3s with Rootless mode for the usage.

Usernetes

Usernetes is a reference distribution of Kubernetes that can be installed under $HOME directory without the root privilege.

Usernetes supports both containerd and CRI-O as CRI runtimes. Usernetes supports multi-node clusters using Flannel (VXLAN).

See the Usernetes repo for the usage.

Manually deploy a node that runs the kubelet in a user namespace

This section provides hints for running Kubernetes in a user namespace manually.

Note:

This section is intended to be read by developers of Kubernetes distributions, not by end users.

Creating a user namespace

The first step is to create a user namespace.

If you are trying to run Kubernetes in a user-namespaced container such as Rootless Docker/Podman or LXC/LXD, you are all set, and you can go to the next subsection.

Otherwise you have to create a user namespace by yourself, by calling unshare(2) with CLONE_NEWUSER.

A user namespace can be also unshared by using command line tools such as:

After unsharing the user namespace, you will also have to unshare other namespaces such as mount namespace.

You do not need to call chroot() nor pivot_root() after unsharing the mount namespace, however, you have to mount writable filesystems on several directories in the namespace.

At least, the following directories need to be writable in the namespace (not outside the namespace):

/etc
/run
/var/logs
/var/lib/kubelet
/var/lib/cni
/var/lib/containerd (for containerd)
/var/lib/containers (for CRI-O)

Creating a delegated cgroup tree

In addition to the user namespace, you also need to have a writable cgroup tree with cgroup v2.

Note:

Kubernetes support for running Node components in user namespaces requires cgroup v2. Cgroup v1 is not supported.

If you are trying to run Kubernetes in Rootless Docker/Podman or LXC/LXD on a systemd-based host, you are all set.

Otherwise you have to create a systemd unit with Delegate=yes property to delegate a cgroup tree with writable permission.

On your node, systemd must already be configured to allow delegation; for more details, see cgroup v2 in the Rootless Containers documentation.

Configuring network

The network namespace of the Node components has to have a non-loopback interface, which can be for example configured with slirp4netns, VPNKit, or lxc-user-nic(1).

The network namespaces of the Pods can be configured with regular CNI plugins. For multi-node networking, Flannel (VXLAN, 8472/UDP) is known to work.

Ports such as the kubelet port (10250/TCP) and NodePort service ports have to be exposed from the Node network namespace to the host with an external port forwarder, such as RootlessKit, slirp4netns, or socat(1).

You can use the port forwarder from K3s. See Running K3s in Rootless Mode for more details. The implementation can be found in the pkg/rootlessports package of k3s.

Configuring CRI

The kubelet relies on a container runtime. You should deploy a container runtime such as containerd or CRI-O and ensure that it is running within the user namespace before the kubelet starts.

containerd
CRI-O

Running CRI plugin of containerd in a user namespace is supported since containerd 1.4.

Running containerd within a user namespace requires the following configurations.

version = 2

[plugins."io.containerd.grpc.v1.cri"]
# Disable AppArmor
  disable_apparmor = true
# Ignore an error during setting oom_score_adj
  restrict_oom_score_adj = true
# Disable hugetlb cgroup v2 controller (because systemd does not support delegating hugetlb controller)
  disable_hugetlb_controller = true

[plugins."io.containerd.grpc.v1.cri".containerd]
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be disabled
  snapshotter = "fuse-overlayfs"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# We use cgroupfs that is delegated by systemd, so we do not use SystemdCgroup driver
# (unless you run another systemd in the namespace)
  SystemdCgroup = false

The default path of the configuration file is /etc/containerd/config.toml. The path can be specified with containerd -c /path/to/containerd/config.toml.

Running CRI-O in a user namespace is supported since CRI-O 1.22.

CRI-O requires an environment variable _CRIO_ROOTLESS=1 to be set.

The following configurations are also recommended:

[crio]
  storage_driver = "overlay"
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be disabled
  storage_option = ["overlay.mount_program=/usr/local/bin/fuse-overlayfs"]

[crio.runtime]
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
  cgroup_manager = "cgroupfs"

The default path of the configuration file is /etc/crio/crio.conf. The path can be specified with crio --config /path/to/crio/crio.conf.

Configuring kubelet

Running kubelet in a user namespace requires the following configuration:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  KubeletInUserNamespace: true
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
cgroupDriver: "cgroupfs"

When the KubeletInUserNamespace feature gate is enabled, the kubelet ignores errors that may happen during setting the following sysctl values on the node.

vm.overcommit_memory
vm.panic_on_oom
kernel.panic
kernel.panic_on_oops
kernel.keys.root_maxkeys
kernel.keys.root_maxbytes.

Within a user namespace, the kubelet also ignores any error raised from trying to open /dev/kmsg. This feature gate also allows kube-proxy to ignore an error during setting RLIMIT_NOFILE.

The KubeletInUserNamespace feature gate was introduced in Kubernetes v1.22 with "alpha" status.

Running kubelet in a user namespace without using this feature gate is also possible by mounting a specially crafted proc filesystem (as done by Sysbox), but not officially supported.

Configuring kube-proxy

Running kube-proxy in a user namespace requires the following configuration:

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "iptables" # or "userspace"
conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
  maxPerCore: 0
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
  tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
  tcpCloseWaitTimeout: 0s

Caveats

Most of "non-local" volume drivers such as nfs and iscsi do not work. Local volumes like local, hostPath, emptyDir, configMap, secret, and downwardAPI are known to work.
Some CNI plugins may not work. Flannel (VXLAN) is known to work.

For more on this, see the Caveats and Future work page on the rootlesscontaine.rs website.

Running Kubernetes Node Components as a Non-root User

Note:

Before you begin

Running Kubernetes inside Rootless Docker/Podman

kind

minikube

Running Kubernetes inside Unprivileged Containers

sysbox

Running Rootless Kubernetes directly on a host

K3s

Usernetes

Manually deploy a node that runs the kubelet in a user namespace

Note:

Creating a user namespace

Creating a delegated cgroup tree

Note:

Configuring network

Configuring CRI

Configuring kubelet

Configuring kube-proxy

Caveats

See Also

Feedback