Control CPU Management Policies on the Node

FEATURE STATE: Kubernetes v1.26 [stable]

Kubernetes keeps many aspects of how pods execute on nodes abstracted from the user. This is by design.  However, some workloads require stronger guarantees in terms of latency and/or performance in order to operate acceptably. The kubelet provides methods to enable more complex workload placement policies while keeping the abstraction free from explicit placement directives.

For detailed information on resource management, please refer to the Resource Management for Pods and Containers documentation.

For detailed information on how the kubelet implements resource management, please refer to the Node ResourceManagers documentation.

Before you begin

You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:

Your Kubernetes server must be at or later than version v1.26. To check the version, enter kubectl version.

If you are running an older version of Kubernetes, please look at the documentation for the version you are actually running.

Configuring CPU management policies

By default, the kubelet uses CFS quota to enforce pod CPU limits.  When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and thus work fine without any intervention.

However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, the kubelet allows alternative CPU management policies to determine some placement preferences on the node.

Windows Support

FEATURE STATE: Kubernetes v1.32 [alpha] (enabled by default: false)

CPU Manager support can be enabled on Windows by using the WindowsCPUAndMemoryAffinity feature gate and it requires support in the container runtime. Once the feature gate is enabled, follow the steps below to configure the CPU manager policy.

Configuration

The CPU Manager policy is set with the --cpu-manager-policy kubelet flag or the cpuManagerPolicy field in KubeletConfiguration. There are two supported policies:

  • none: the default policy.
  • static: allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node.

The CPU manager periodically writes resource updates through the CRI in order to reconcile in-memory CPU assignments with cgroupfs. The reconcile frequency is set through a new Kubelet configuration value --cpu-manager-reconcile-period. If not specified, it defaults to the same duration as --node-status-update-frequency.

The behavior of the static policy can be fine-tuned using the --cpu-manager-policy-options flag. The flag takes a comma-separated list of key=value policy options. If you disable the CPUManagerPolicyOptions feature gate then you cannot fine-tune CPU manager policies. In that case, the CPU manager operates only using its default settings.

In addition to the top-level CPUManagerPolicyOptions feature gate, the policy options are split into two groups: alpha quality (hidden by default) and beta quality (visible by default). The groups are guarded respectively by the CPUManagerPolicyAlphaOptions and CPUManagerPolicyBetaOptions feature gates. Diverging from the Kubernetes standard, these feature gates guard groups of options, because it would have been too cumbersome to add a feature gate for each individual option.

Changing the CPU Manager Policy

Since the CPU manager policy can only be applied when kubelet spawns new pods, simply changing from "none" to "static" won't apply to existing pods. So in order to properly change the CPU manager policy on a node, perform the following steps:

  1. Drain the node.
  2. Stop kubelet.
  3. Remove the old CPU manager state file. The path to this file is /var/lib/kubelet/cpu_manager_state by default. This clears the state maintained by the CPUManager so that the cpu-sets set up by the new policy won’t conflict with it.
  4. Edit the kubelet configuration to change the CPU manager policy to the desired value.
  5. Start kubelet.

Repeat this process for every node that needs its CPU manager policy changed. Skipping this process will result in kubelet crashlooping with the following error:

could not restore state from checkpoint: configured policy "static" differs from state checkpoint policy "none", please drain this node and delete the CPU manager checkpoint file "/var/lib/kubelet/cpu_manager_state" before restarting Kubelet

none policy configuration

This policy has no extra configuration items.

static policy configuration

This policy manages a shared pool of CPUs that initially contains all CPUs in the node. The amount of exclusively allocatable CPUs is equal to the total number of CPUs in the node minus any CPU reservations by the kubelet --kube-reserved or --system-reserved options. From 1.17, the CPU reservation list can be specified explicitly by kubelet --reserved-cpus option. The explicit CPU list specified by --reserved-cpus takes precedence over the CPU reservation specified by --kube-reserved and --system-reserved. CPUs reserved by these options are taken, in integer quantity, from the initial shared pool in ascending order by physical core ID.  This shared pool is the set of CPUs on which any containers in BestEffort and Burstable pods run. Containers in Guaranteed pods with fractional CPU requests also run on CPUs in the shared pool. Only containers that are both part of a Guaranteed pod and have integer CPU requests are assigned exclusive CPUs.

Static policy options

You can toggle groups of options on and off based upon their maturity level using the following feature gates:

  • CPUManagerPolicyBetaOptions default enabled. Disable to hide beta-level options.
  • CPUManagerPolicyAlphaOptions default disabled. Enable to show alpha-level options. You will still have to enable each option using the CPUManagerPolicyOptions kubelet option.

The following policy options exist for the static CPUManager policy:

  • full-pcpus-only (beta, visible by default) (1.22 or higher)
  • distribute-cpus-across-numa (alpha, hidden by default) (1.23 or higher)
  • align-by-socket (alpha, hidden by default) (1.25 or higher)
  • distribute-cpus-across-cores (alpha, hidden by default) (1.31 or higher)
  • strict-cpu-reservation (alpha, hidden by default) (1.32 or higher)
  • prefer-align-cpus-by-uncorecache (alpha, hidden by default) (1.32 or higher)

The full-pcpus-only option can be enabled by adding full-pcpus-only=true to the CPUManager policy options. Likewise, the distribute-cpus-across-numa option can be enabled by adding distribute-cpus-across-numa=true to the CPUManager policy options. When both are set, they are "additive" in the sense that CPUs will be distributed across NUMA nodes in chunks of full-pcpus rather than individual cores. The align-by-socket policy option can be enabled by adding align-by-socket=true to the CPUManager policy options. It is also additive to the full-pcpus-only and distribute-cpus-across-numa policy options.

The distribute-cpus-across-cores option can be enabled by adding distribute-cpus-across-cores=true to the CPUManager policy options. It cannot be used with full-pcpus-only or distribute-cpus-across-numa policy options together at this moment.

The strict-cpu-reservation option can be enabled by adding strict-cpu-reservation=true to the CPUManager policy options followed by removing the /var/lib/kubelet/cpu_manager_state file and restart kubelet.

The prefer-align-cpus-by-uncorecache option can be enabled by adding the prefer-align-cpus-by-uncorecache to the CPUManager policy options. If incompatible options are used, the kubelet will fail to start with the error explained in the logs.

For mode detail about the behavior of the individual options you can configure, please refer to the Node ResourceManagers documentation.

Last modified December 15, 2024 at 6:24 PM PST: Merge pull request #49087 from Arhell/es-link (2c4497f)