Kubelet Systemd Watchdog

FEATURE STATE: Kubernetes v1.32 [beta] (enabled by default: true)

On Linux nodes, Kubernetes 1.32 supports integrating with systemd to allow the operating system supervisor to recover a failed kubelet. This integration is not enabled by default. It can be used as an alternative to periodically requesting the kubelet's /healthz endpoint for health checks. If the kubelet does not respond to the watchdog within the timeout period, the watchdog will kill the kubelet.

The systemd watchdog works by requiring the service to periodically send a keep-alive signal to the systemd process. If the signal is not received within a specified timeout period, the service is considered unresponsive and is terminated. The service can then be restarted according to the configuration.

Configuration

Using the systemd watchdog requires configuring the WatchdogSec parameter in the [Service] section of the kubelet service unit file:

[Service]
WatchdogSec=30s

Setting WatchdogSec=30s indicates a service watchdog timeout of 30 seconds. Within the kubelet, the sd_notify() function is invoked, at intervals of WatchdogSec ÷ 2, to send WATCHDOG=1 (a keep-alive message). If the watchdog is not fed within the timeout period, the kubelet will be killed. Setting Restart to "always", "on-failure", "on-watchdog", or "on-abnormal" will ensure that the service is automatically restarted.

Some details about the systemd configuration:

  1. If you set the systemd value for WatchdogSec to 0, or omit setting it, the systemd watchdog is not enabled for this unit.
  2. The kubelet supports a minimum watchdog period of 1.0 seconds; this is to prevent the kubelet from being killed unexpectedly. You can set the value of WatchdogSec in a systemd unit definition to a period shorter than 1 second, but Kubernetes does not support any shorter interval. The timeout does not have to be a whole integer number of seconds.
  3. The Kubernetes project suggests setting WatchdogSec to approximately a 15s period. Periods longer than 10 minutes are supported but explicitly not recommended.

Example Configuration

[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/home/
Wants=network-online.target
After=network-online.target

[Service]
ExecStart=/usr/bin/kubelet
# Configures the watchdog timeout
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

What's next

For more details about systemd configuration, refer to the systemd documentation

Last modified December 15, 2024 at 6:24 PM PST: Merge pull request #49087 from Arhell/es-link (2c4497f)