This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
 
Node Reference Information
    
      This section contains the following reference topics about nodes:
You can also read node reference details from elsewhere in the
Kubernetes documentation, including:
 
 
  
  
  
  
  
  
  
    
    
	
    
    
	1 - Kubelet Checkpoint API
    
	
  
              FEATURE STATE: 
              Kubernetes v1.30 [beta] (enabled by default: true)
            
Checkpointing a container is the functionality to create a stateful copy of a
running container. Once you have a stateful copy of a container, you could
move it to a different computer for debugging or similar purposes.
If you move the checkpointed container data to a computer that's able to restore
it, that restored container continues to run at exactly the same
point it was checkpointed. You can also inspect the saved data, provided that you
have suitable tools for doing so.
Creating a checkpoint of a container might have security implications. Typically
a checkpoint contains all memory pages of all processes in the checkpointed
container. This means that everything that used to be in memory is now available
on the local disk. This includes all private data and possibly keys used for
encryption. The underlying CRI implementations (the container runtime on that node)
should create the checkpoint archive to be only accessible by the root user. It
is still important to remember if the checkpoint archive is transferred to another
system all memory pages will be readable by the owner of the checkpoint archive.
Operations
post checkpoint the specified container
Tell the kubelet to checkpoint a specific container from the specified Pod.
Consult the Kubelet authentication/authorization reference
for more information about how access to the kubelet checkpoint interface is
controlled.
The kubelet will request a checkpoint from the underlying
CRI implementation. In the checkpoint
request the kubelet will specify the name of the checkpoint archive as
checkpoint-<podFullName>-<containerName>-<timestamp>.tar and also request to
store the checkpoint archive in the checkpoints directory below its root
directory (as defined by --root-dir).  This defaults to
/var/lib/kubelet/checkpoints.
The checkpoint archive is in tar format, and could be listed using an implementation of
tar. The contents of the
archive depend on the underlying CRI implementation (the container runtime on that node).
HTTP Request
POST /checkpoint/{namespace}/{pod}/{container}
Parameters
- 
namespace (in path): string, required Namespace
- 
pod (in path): string, required Pod
- 
container (in path): string, required Container
- 
timeout (in query): integer Timeout in seconds to wait until the checkpoint creation is finished.
If zero or no timeout is specified the default CRI timeout value will be used. Checkpoint
creation time depends directly on the used memory of the container.
The more memory a container uses the more time is required to create
the corresponding checkpoint. 
Response
200: OK
401: Unauthorized
404: Not Found (if the ContainerCheckpoint feature gate is disabled)
404: Not Found (if the specified namespace, pod or container cannot be found)
500: Internal Server Error (if the CRI implementation encounter an error during checkpointing (see error message for further details))
500: Internal Server Error (if the CRI implementation does not implement the checkpoint CRI API (see error message for further details))
 
    
	
  
    
    
	
    
    
	2 - Linux Kernel Version Requirements
    
	Note: This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a project to this list, read the 
content guide before submitting a change. 
More information. 
Many features rely on specific kernel functionalities and have minimum kernel version requirements.
However, relying solely on kernel version numbers may not be sufficient
for certain operating system distributions,
as maintainers for distributions such as RHEL, Ubuntu and SUSE often backport selected features
to older kernel releases (retaining the older kernel version).
Pod sysctls
On Linux, the sysctl() system call configures kernel parameters at run time. There is a command
line tool named sysctl that you can use to configure these parameters, and many are exposed via
the proc filesystem.
Some sysctls are only available if you have a modern enough kernel.
The following sysctls have a minimal kernel version requirement,
and are supported in the safe set:
- net.ipv4.ip_local_reserved_ports(since Kubernetes 1.27, needs kernel 3.16+);
- net.ipv4.tcp_keepalive_time(since Kubernetes 1.29, needs kernel 4.5+);
- net.ipv4.tcp_fin_timeout(since Kubernetes 1.29, needs kernel 4.6+);
- net.ipv4.tcp_keepalive_intvl(since Kubernetes 1.29, needs kernel 4.5+);
- net.ipv4.tcp_keepalive_probes(since Kubernetes 1.29, needs kernel 4.5+);
- net.ipv4.tcp_syncookies(namespaced since kernel 4.6+).
- net.ipv4.tcp_rmem(since Kubernetes 1.32, needs kernel 4.15+).
- net.ipv4.tcp_wmem(since Kubernetes 1.32, needs kernel 4.15+).
- net.ipv4.vs.conn_reuse_mode(used in- ipvsproxy mode, needs kernel 4.1+);
kube proxy nftables proxy mode
For Kubernetes 1.35, the
nftables mode of kube-proxy requires
version 1.0.1 or later
of the nft command-line tool, as well as kernel 5.13 or later.
For testing/development purposes, you can use older kernels, as far back as 5.4 if you set the
nftables.skipKernelVersionCheck option in the kube-proxy config.
But this is not recommended in production since it may cause problems with other nftables
users on the system.
Version 2 control groups
Kubernetes cgroup v1 support is in maintained mode starting from Kubernetes v1.31; using cgroup v2
is recommended.
In Linux 5.8, the system-level cpu.stat file was added to the root cgroup for convenience.
In runc document, Kernel older than 5.2 is not recommended due to lack of freezer.
Pressure Stall Information (PSI)
Pressure Stall Information is supported in Linux kernel versions 4.20 and up, but requires the following configuration:
- The kernel must be compiled with the CONFIG_PSI=yoption. Most modern distributions enable this by default. You can check your kernel's configuration by runningzgrep CONFIG_PSI /proc/config.gz.
- Some Linux distributions may compile PSI into the kernel but disable it by default. If so, you need to enable it at boot time by adding the psi=1parameter to the kernel command line.
Other kernel requirements
Some features may depend on new kernel functionalities and have specific kernel requirements:
- Recursive read only mount:
This is implemented by applying the MOUNT_ATTR_RDONLYattribute with theAT_RECURSIVEflag
usingmount_setattr(2) added in Linux kernel v5.12.
- Pod user namespace support requires minimal kernel version 6.5+, according to
KEP-127.
- For node system swap, tmpfs set to noswapis not supported until kernel 6.3.
Linux kernel long term maintenance
Active kernel releases can be found in kernel.org.
There are usually several long term maintenance kernel releases provided for the purposes of backporting
bug fixes for older kernel trees. Only important bug fixes are applied to such kernels and they don't
usually see very frequent releases, especially for older trees.
See the Linux kernel website for the list of releases
in the Longterm category.
What's next
 
    
	
  
    
    
	
    
    
	3 - Articles on dockershim Removal and on Using CRI-compatible Runtimes
    
	
This is a list of articles and other pages that are either
about the Kubernetes' deprecation and removal of dockershim,
or about using CRI-compatible container runtimes,
in connection with that removal.
Kubernetes project
You can provide feedback via the GitHub issue Dockershim removal feedback & issues. (k/kubernetes/#106917)
External sources
 
    
	
  
    
    
	
    
    
	4 - Node Labels Populated By The Kubelet
    
	Kubernetes nodes come pre-populated
with a standard set of labels.
You can also set your own labels on nodes, either through the kubelet configuration or
using the Kubernetes API.
Preset labels
The preset labels that Kubernetes sets on nodes are:
Note:
The value of these labels is cloud provider specific and is not guaranteed to be reliable.
For example, the value of kubernetes.io/hostname may be the same as the node name in some environments
and a different value in other environments.
What's next
 
    
	
  
    
    
	
    
    
	5 - Local Files And Paths Used By The Kubelet
    
	The kubelet is mostly a stateless
process running on a Kubernetes node.
This document outlines files that kubelet reads and writes.
Note:
This document is for informational purpose and not describing any guaranteed behaviors or APIs.
It lists resources used by the kubelet, which is an implementation detail and a subject to change at any release.
The kubelet typically uses the control plane as
the source of truth on what needs to run on the Node, and the
container runtime to retrieve
the current state of containers. So long as you provide a kubeconfig (API client configuration)
to the kubelet, the kubelet does connect to your control plane; otherwise the node operates in
standalone mode.
On Linux nodes, the kubelet also relies on reading cgroups and various system files to collect metrics.
On Windows nodes, the kubelet collects metrics via a different mechanism that does not rely on
paths.
There are also a few other files that are used by the kubelet as well,
as kubelet communicates using local Unix-domain sockets. Some are sockets that the
kubelet listens on, and for other sockets the kubelet discovers them and then connects
as a client.
Note:
This page lists paths as Linux paths, which map to the Windows paths by adding a root disk
C:\ in place of / (unless specified otherwise).
For example, /var/lib/kubelet/device-plugins maps to C:\var\lib\kubelet\device-plugins.
Configuration
Kubelet configuration files
The path to the kubelet configuration file can be configured
using the command line argument --config. The kubelet also supports
drop-in configuration files
to enhance configuration.
Certificates
Certificates and private keys are typically located at /var/lib/kubelet/pki,
but can be configured using the --cert-dir kubelet command line argument.
Names of certificate files are also configurable.
Manifests
Manifests for static pods are typically located in /etc/kubernetes/manifests.
Location can be configured using the staticPodPath kubelet configuration option.
Systemd unit settings
When kubelet is running as a systemd unit, some kubelet configuration may be declared
in systemd unit settings file. Typically it includes:
State
Checkpoint files for resource managers
All resource managers keep the mapping of Pods to allocated resources in state files.
State files are located in the kubelet's base directory, also termed the root directory
(but not the same as /, the node root directory). You can configure the base directory
for the kubelet
using the kubelet command line argument --root-dir.
Names of files:
Checkpoint file for device manager
Device manager creates checkpoints in the same directory with socket files: /var/lib/kubelet/device-plugins/.
The name of a checkpoint file is kubelet_internal_checkpoint for
Device Manager
Pod resource checkpoints
  
              FEATURE STATE: 
              Kubernetes v1.33 [beta] (enabled by default: true)
            
If a node has enabled the InPlacePodVerticalScalingfeature gate,
the kubelet stores a local record of allocated and actuated Pod resources.
See Resize CPU and Memory Resources assigned to Containers
for more details on how these records are used.
Names of files:
- allocated_pods_staterecords the resources allocated to each pod running on the node
- actuated_pods_staterecords the resources that have been accepted by the runtime
for each pod pod running on the node
The files are located within the kubelet base directory
(/var/lib/kubelet by default on Linux; configurable using --root-dir).
Container runtime
Kubelet communicates with the container runtime using socket configured via the
configuration parameters:
- containerRuntimeEndpointfor runtime operations
- imageServiceEndpointfor image management operations
The actual values of those endpoints depend on the container runtime being used.
Device plugins
The kubelet exposes a socket at the path /var/lib/kubelet/device-plugins/kubelet.sock for
various Device Plugins to register.
When a device plugin registers itself, it provides its socket path for the kubelet to connect.
The device plugin socket should be in the directory device-plugins within the kubelet base
directory. On a typical Linux node, this means /var/lib/kubelet/device-plugins.
Pod resources API
Pod Resources API
will be exposed at the path /var/lib/kubelet/pod-resources.
DRA, CSI, and Device plugins
The kubelet looks for socket files created by device plugins managed via DRA,
device manager, or storage plugins, and then attempts to connect
to these sockets. The directory that the kubelet looks in is plugins_registry within the kubelet base
directory, so on a typical Linux node this means /var/lib/kubelet/plugins_registry.
Note, for the device plugins there are two alternative registration mechanisms
Only one should be used for a given plugin.
The types of plugins that can place socket files into that directory are:
- CSI plugins
- DRA plugins
- Device Manager plugins
(typically /var/lib/kubelet/plugins_registry).
Graceful node shutdown
  
              FEATURE STATE: 
              Kubernetes v1.21 [beta] (enabled by default: true)
            
Graceful node shutdown
stores state locally at /var/lib/kubelet/graceful_node_shutdown_state.
Image Pull Records
  
              FEATURE STATE: 
              Kubernetes v1.33 [alpha] (enabled by default: false)
            
The kubelet stores records of attempted and successful image pulls, and uses it
to verify that the image was previously successfully pulled with the same credentials.
These records are cached as files in the image_registry directory within
the kubelet base directory. On a typical Linux node, this means /var/lib/kubelet/image_manager.
There are two subdirectories to image_manager:
- pulling- stores records about images the Kubelet is attempting to pull.
- pulled- stores records about images that were successfully pulled by the Kubelet,
along with metadata about the credentials used for the pulls.
See Ensure Image Pull Credential Verification
for details.
Security profiles & configuration
Seccomp
Seccomp profile files referenced from Pods should be placed in /var/lib/kubelet/seccomp.
See the seccomp reference for details.
AppArmor
The kubelet does not load or refer to AppArmor profiles by a Kubernetes-specific path.
AppArmor profiles are loaded via the node operating system rather then referenced by their path.
Locking
  
      FEATURE STATE:
      Kubernetes v1.2 [alpha]
    
  
A lock file for the kubelet; typically /var/run/kubelet.lock. The kubelet uses this to ensure
that two different kubelets don't try to run in conflict with each other.
You can configure the path to the lock file using the the --lock-file kubelet command line argument.
If two kubelets on the same node use a different value for the lock file path, they will not be able to
detect a conflict when both are running.
What's next
 
    
	
  
    
    
	
    
    
	6 - Kubelet Configuration Directory Merging
    
	When using the kubelet's --config-dir flag to specify a drop-in directory for
configuration, there is some specific behavior on how different types are
merged.
Here are some examples of how different data types behave during configuration merging:
Structure Fields
There are two types of structure fields in a YAML structure: singular (or a
scalar type) and embedded (structures that contain scalar types).
The configuration merging process handles the overriding of singular and embedded struct fields to create a resulting kubelet configuration.
For instance, you may want a baseline kubelet configuration for all nodes, but you may want to customize the address and authorization fields.
This can be done as follows:
Main kubelet configuration file contents:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: "5m"
    cacheUnauthorizedTTL: "30s"
serializeImagePulls: false
address: "192.168.0.1"
Contents of a file in --config-dir directory:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
authorization:
  mode: AlwaysAllow
  webhook:
    cacheAuthorizedTTL: "8m"
    cacheUnauthorizedTTL: "45s"
address: "192.168.0.8"
The resulting configuration will be as follows:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
authorization:
  mode: AlwaysAllow
  webhook:
    cacheAuthorizedTTL: "8m"
    cacheUnauthorizedTTL: "45s"
address: "192.168.0.8"
Lists
You can override the slices/lists values of the kubelet configuration.
However, the entire list gets overridden during the merging process.
For example, you can override the clusterDNS list as follows:
Main kubelet configuration file contents:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
clusterDNS:
  - "192.168.0.9"
  - "192.168.0.8"
Contents of a file in --config-dir directory:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clusterDNS:
  - "192.168.0.2"
  - "192.168.0.3"
  - "192.168.0.5"
The resulting configuration will be as follows:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
clusterDNS:
  - "192.168.0.2"
  - "192.168.0.3"
  - "192.168.0.5"
Maps, including Nested Structures
Individual fields in maps, regardless of their value types (boolean, string, etc.), can be selectively overridden.
However, for map[string][]string, the entire list associated with a specific field gets overridden.
Let's understand this better with an example, particularly on fields like featureGates and staticPodURLHeader:
Main kubelet configuration file contents:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
featureGates:
  AllAlpha: false
  MemoryQoS: true
staticPodURLHeader:
  kubelet-api-support:
  - "Authorization: 234APSDFA"
  - "X-Custom-Header: 123"
  custom-static-pod:
  - "Authorization: 223EWRWER"
  - "X-Custom-Header: 456"
Contents of a file in --config-dir directory:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  MemoryQoS: false
  KubeletTracing: true
  DynamicResourceAllocation: true
staticPodURLHeader:
  custom-static-pod:
  - "Authorization: 223EWRWER"
  - "X-Custom-Header: 345"
The resulting configuration will be as follows:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
featureGates:
  AllAlpha: false
  MemoryQoS: false
  KubeletTracing: true
  DynamicResourceAllocation: true
staticPodURLHeader:
  kubelet-api-support:
  - "Authorization: 234APSDFA"
  - "X-Custom-Header: 123"
  custom-static-pod:
  - "Authorization: 223EWRWER"
  - "X-Custom-Header: 345"
 
    
	
  
    
    
	
    
    
	7 - Kubelet Device Manager API Versions
    
	This page provides details of version compatibility between the Kubernetes
device plugin API,
and different versions of Kubernetes itself.
Compatibility matrix
|  | v1alpha1 | v1beta1 | 
| Kubernetes 1.21 | - | ✓ | 
| Kubernetes 1.22 | - | ✓ | 
| Kubernetes 1.23 | - | ✓ | 
| Kubernetes 1.24 | - | ✓ | 
| Kubernetes 1.25 | - | ✓ | 
| Kubernetes 1.26 | - | ✓ | 
Key:
- ✓Exactly the same features / API objects in both device plugin API and
the Kubernetes version.
- +The device plugin API has features or API objects that may not be present in the
Kubernetes cluster, either because the device plugin API has added additional new API
calls, or that the server has removed an old API call. However, everything they have in
common (most other APIs) will work. Note that alpha APIs may vanish or
change significantly between one minor release and the next.
- -The Kubernetes cluster has features the device plugin API can't use,
either because server has added additional API calls, or that device plugin API has
removed an old API call. However, everything they share in common (most APIs) will work.
 
    
	
  
    
    
	
    
    
	8 - Kubelet Systemd Watchdog
    
	
  
              FEATURE STATE: 
              Kubernetes v1.32 [beta] (enabled by default: true)
            
On Linux nodes, Kubernetes 1.35 supports integrating with
systemd to allow the operating system supervisor to recover
a failed kubelet. This integration is not enabled by default.
It can be used as an alternative to periodically requesting
the kubelet's /healthz endpoint for health checks. If the kubelet
does not respond to the watchdog within the timeout period, the watchdog
will kill the kubelet.
The systemd watchdog works by requiring the service to periodically send
a keep-alive signal to the systemd process. If the signal is not received
within a specified timeout period, the service is considered unresponsive
and is terminated. The service can then be restarted according to the configuration.
Configuration
Using the systemd watchdog requires configuring the WatchdogSec parameter
in the [Service] section of the kubelet service unit file:
[Service]
WatchdogSec=30s
Setting WatchdogSec=30s indicates a service watchdog timeout of 30 seconds.
Within the kubelet, the sd_notify() function is invoked, at intervals of \( WatchdogSec \div 2\). to send
WATCHDOG=1 (a keep-alive message). If the watchdog is not fed
within the timeout period, the kubelet will be killed. Setting Restart
to "always", "on-failure", "on-watchdog", or "on-abnormal" will ensure that the service
is automatically restarted.
Some details about the systemd configuration:
- If you set the systemd value for WatchdogSecto 0, or omit setting it, the systemd watchdog is not
enabled for this unit.
- The kubelet supports a minimum watchdog period of 1.0 seconds; this is to prevent the kubelet
from being killed unexpectedly. You can set the value of WatchdogSecin a systemd unit definition
to a period shorter than 1 second, but Kubernetes does not support any shorter interval.
The timeout does not have to be a whole integer number of seconds.
- The Kubernetes project suggests setting WatchdogSecto approximately a 15s period.
Periods longer than 10 minutes are supported but explicitly not recommended.
Example Configuration
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/home/
Wants=network-online.target
After=network-online.target
[Service]
ExecStart=/usr/bin/kubelet
# Configures the watchdog timeout
WatchdogSec=30s
Restart=on-failure
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target
What's next
For more details about systemd configuration, refer to the
systemd documentation
 
    
	
  
    
    
	
    
    
	9 - Node Status
    
	
The status of a node in Kubernetes is a critical
aspect of managing a Kubernetes cluster. In this article, we'll cover the basics of
monitoring and maintaining node status to ensure a healthy and stable cluster.
Node status fields
A Node's status contains the following information:
You can use kubectl to view a Node's status and other details:
kubectl describe node <insert-node-name-here>
Each section of the output is described below.
Addresses
The usage of these fields varies depending on your cloud provider or bare metal configuration.
- HostName: The hostname as reported by the node's kernel. Can be overridden via the kubelet
--hostname-overrideparameter.
- ExternalIP: Typically the IP address of the node that is externally routable (available from
outside the cluster).
- InternalIP: Typically the IP address of the node that is routable only within the cluster.
Conditions
The conditions field describes the status of all Running nodes. Examples of conditions include:
 
Node conditions, and a description of when each condition applies.
| Node Condition | Description | 
| Ready | Trueif the node is healthy and ready to accept pods,Falseif the node is not healthy and is not accepting pods, andUnknownif the node controller has not heard from the node in the lastnode-monitor-grace-period(default is 50 seconds) | 
| DiskPressure | Trueif pressure exists on the disk size—that is, if the disk capacity is low; otherwiseFalse | 
| MemoryPressure | Trueif pressure exists on the node memory—that is, if the node memory is low; otherwiseFalse | 
| PIDPressure | Trueif pressure exists on the processes—that is, if there are too many processes on the node; otherwiseFalse | 
| NetworkUnavailable | Trueif the network for the node is not correctly configured, otherwiseFalse | 
Note:
If you use command-line tools to print details of a cordoned Node, the Condition includes
SchedulingDisabled. SchedulingDisabled is not a Condition in the Kubernetes API; instead,
cordoned nodes are marked Unschedulable in their spec.
In the Kubernetes API, a node's condition is represented as part of the .status
of the Node resource. For example, the following JSON structure describes a healthy node:
"conditions": [
  {
    "type": "Ready",
    "status": "True",
    "reason": "KubeletReady",
    "message": "kubelet is posting ready status",
    "lastHeartbeatTime": "2019-06-05T18:38:35Z",
    "lastTransitionTime": "2019-06-05T11:41:27Z"
  }
]
When problems occur on nodes, the Kubernetes control plane automatically creates
taints that match the conditions
affecting the node. An example of this is when the status of the Ready condition
remains Unknown or False for longer than the kube-controller-manager's NodeMonitorGracePeriod,
which defaults to 50 seconds. This will cause either an node.kubernetes.io/unreachable taint, for an Unknown status,
or a node.kubernetes.io/not-ready taint, for a False status, to be added to the Node.
These taints affect pending pods as the scheduler takes the Node's taints into consideration when
assigning a pod to a Node. Existing pods scheduled to the node may be evicted due to the application
of NoExecute taints. Pods may also have tolerations that let
them schedule to and continue running on a Node even though it has a specific taint.
See Taint Based Evictions and
Taint Nodes by Condition
for more details.
Capacity and Allocatable
Describes the resources available on the node: CPU, memory, and the maximum
number of pods that can be scheduled onto the node.
The fields in the capacity block indicate the total amount of resources that a
Node has. The allocatable block indicates the amount of resources on a
Node that is available to be consumed by normal Pods.
You may read more about capacity and allocatable resources while learning how
to reserve compute resources
on a Node.
Info
Describes general information about the node, such as kernel version, Kubernetes
version (kubelet and kube-proxy version), container runtime details, and which
operating system the node uses.
The kubelet gathers this information from the node and publishes it into
the Kubernetes API.
Heartbeats
Heartbeats, sent by Kubernetes nodes, help your cluster determine the
availability of each node, and to take action when failures are detected.
For nodes there are two forms of heartbeats:
- updates to the .statusof a Node
- Lease objects
within the kube-node-leasenamespace.
Each Node has an associated Lease object.
Compared to updates to .status of a Node, a Lease is a lightweight resource.
Using Leases for heartbeats reduces the performance impact of these updates
for large clusters.
The kubelet is responsible for creating and updating the .status of Nodes,
and for updating their related Leases.
- The kubelet updates the node's .statuseither when there is change in status
or if there has been no update for a configured interval. The default interval
for.statusupdates to Nodes is 5 minutes, which is much longer than the 40
second default timeout for unreachable nodes.
- The kubelet creates and then updates its Lease object every 10 seconds
(the default update interval). Lease updates occur independently from
updates to the Node's .status. If the Lease update fails, the kubelet retries,
using exponential backoff that starts at 200 milliseconds and capped at 7 seconds.
 
    
	
  
    
    
	
    
    
	10 - Seccomp and Kubernetes
    
	
Seccomp stands for secure computing mode and has been a feature of the Linux
kernel since version 2.6.12. It can be used to sandbox the privileges of a
process, restricting the calls it is able to make from userspace into the
kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a
node to your Pods and containers.
Seccomp fields
  
      FEATURE STATE:
      Kubernetes v1.19 [stable]
    
  
There are four ways to specify a seccomp profile for a
pod:
    
    apiVersion: v1
kind: Pod
metadata:
  name: pod
spec:
  securityContext:
    seccompProfile:
      type: Unconfined
  ephemeralContainers:
  - name: ephemeral-container
    image: debian
    securityContext:
      seccompProfile:
        type: RuntimeDefault
  initContainers:
  - name: init-container
    image: debian
    securityContext:
      seccompProfile:
        type: RuntimeDefault
  containers:
  - name: container
    image: docker.io/library/debian:stable
    securityContext:
      seccompProfile:
        type: Localhost
        localhostProfile: my-profile.json
 
The Pod in the example above runs as Unconfined, while the
ephemeral-container and init-container specifically defines
RuntimeDefault. If the ephemeral or init container would not have set the
securityContext.seccompProfile field explicitly, then the value would be
inherited from the Pod. The same applies to the container, which runs a
Localhost profile my-profile.json.
Generally speaking, fields from (ephemeral) containers have a higher priority
than the Pod level value, while containers which do not set the seccomp field
inherit the profile from the Pod.
Note:
It is not possible to apply a seccomp profile to a Pod or container running with
privileged: true set in the container's securityContext. Privileged
containers always run as Unconfined.
The following values are possible for the seccompProfile.type:
- Unconfined
- The workload runs without any seccomp restrictions.
- RuntimeDefault
- A default seccomp profile defined by the
container runtime
is applied. The default profiles aim to provide a strong set of security
defaults while preserving the functionality of the workload. It is possible that
the default profiles differ between container runtimes and their release
versions, for example when comparing those from
CRI-O and
containerd.
- Localhost
- The localhostProfilewill be applied, which has to be available on the node
disk (on Linux it's/var/lib/kubelet/seccomp). The availability of the seccomp
profile is verified by the
container runtime
on container creation. If the profile does not exist, then the container
creation will fail with aCreateContainerError.
Localhost profiles
Seccomp profiles are JSON files following the scheme defined by the
OCI runtime specification.
A profile basically defines actions based on matched syscalls, but also allows
to pass specific values as arguments to syscalls. For example:
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 38,
  "syscalls": [
    {
      "names": [
        "adjtimex",
        "alarm",
        "bind",
        "waitid",
        "waitpid",
        "write",
        "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}
The defaultAction in the profile above is defined as SCMP_ACT_ERRNO and
will return as fallback to the actions defined in syscalls. The error is
defined as code 38 via the defaultErrnoRet field.
The following actions are generally possible:
- SCMP_ACT_ERRNO
- Return the specified error code.
- SCMP_ACT_ALLOW
- Allow the syscall to be executed.
- SCMP_ACT_KILL_PROCESS
- Kill the process.
- SCMP_ACT_KILL_THREADand- SCMP_ACT_KILL
- Kill only the thread.
- SCMP_ACT_TRAP
- Throw a SIGSYSsignal.
- SCMP_ACT_NOTIFYand- SECCOMP_RET_USER_NOTIF.
- Notify the user space.
- SCMP_ACT_TRACE
- Notify a tracing process with the specified value.
- SCMP_ACT_LOG
- Allow the syscall to be executed after the action has been logged to syslog or
auditd.
Some actions like SCMP_ACT_NOTIFY or SECCOMP_RET_USER_NOTIF may be not
supported depending on the container runtime, OCI runtime or Linux kernel
version being used. There may be also further limitations, for example that
SCMP_ACT_NOTIFY cannot be used as defaultAction or for certain syscalls like
write. All those limitations are defined by either the OCI runtime
(runc,
crun) or
libseccomp.
The syscalls JSON array contains a list of objects referencing syscalls by
their respective names. For example, the action SCMP_ACT_ALLOW can be used
to create a whitelist of allowed syscalls as outlined in the example above. It
would also be possible to define another list using the action SCMP_ACT_ERRNO
but a different return (errnoRet) value.
It is also possible to specify the arguments (args) passed to certain
syscalls. More information about those advanced use cases can be found in the
OCI runtime spec
and the Seccomp Linux kernel documentation.
Further reading
 
    
	
  
    
    
	
    
    
	11 - Linux Node Swap Behaviors
    
	To allow Kubernetes workloads to use swap, on a Linux node,
you must disable the kubelet's default behavior of failing when swap is detected,
and specify memory-swap behavior as LimitedSwap:
The available choices for swap behavior are:
- NoSwap
- (default) Workloads running as Pods on this node do not and cannot use swap. However, processes
outside of Kubernetes' scope, such as system daemons (including the kubelet itself!) can utilize swap.
This behavior is beneficial for protecting the node from system-level memory spikes,
but it does not safeguard the workloads themselves from such spikes.
- LimitedSwap
- Kubernetes workloads can utilize swap memory. The amount of swap available to a Pod is determined automatically.
To learn more, read swap memory management.