High-Density Container Deployment
Feature Overview
When a large number of Pods are deployed on worker nodes, the following problems may occur:
•·When containerd is running, it creates a shim process for each Pod for Pod management and monitoring. When there are a large number of Pods, the shim processes occupy a large amount of memory.
The shim processes and containers are in 1:1 mapping. The shim processes created by containerd are written in Go and occupy about 20 MB memory. When there are 1,000 Pods, about 20 GB memory is occupied.
• The CPU usage of the native kubelet probe mechanism (exec/http) is high. When there are a large number of Pods, the detection failure rate increases.
The native probe mechanism runs through the CRI interface. When there are a large number of containers, a large number of concurrent calls are made, causing high system noise floor and slowing down the system response.
• A large number of containers generate a large number of cgroups. As a result, the management performance and reliability of systemd are reduced.
Therefore, the preceding problems need to be optimized and mitigated.
Application Scenarios
• The lightweight probe is correctly configured. The service side periodically updates the target file. In this case, the probe detects the file update and returns success.
• The lightweight probe is correctly configured. The container fails to be started and the target file cannot be periodically updated. In this case, the probe detects that the file update duration exceeds the configured value. As a result, the probe returns a failure.
Highlights
• The shim processes are eliminated.
The shim function is implemented in containerd, and the kernel eBPF technology is used to monitor process exit, reducing the memory usage of the container base.
• A lightweight kubelet probe is created.
The kubelet probe is implemented using low-cost methods such as file detection instead of message interaction or CRI interaction, reducing the CPU usage of the container base.
Implementation Principles
Design Principles
- By default, the community-native capabilities are retained, and new functions are disabled by default.
- The changes to the native code are minimized.
- The basic information about the request access is recorded at the minimum storage cost.
New Parameters
| Parameter | Default Value | Description |
|---|---|---|
| file | An empty string | Lightweight probe file. |
| path | An empty string | Location of the target file. |
| validity | 0 | If the modification time is within the value range, the probe returns success. Otherwise, the probe returns a failure. |
Shimless Implementation Solution
After containerd starts the process, it creates pidfd to trace the process exit event and records the process exit code using the sched_process_exit eBPF program. Even if a container process exits during the containerd restart, the kernel eBPF program does not exit. After containerd is started, the exit code recorded by the eBPF program can be read to identify the process exit event during the restart. The procedure is as follows:
• When containerd is started, the eBPF program sched_process_exit is loaded in the kernel.
• containerd uses runc to start the container, records the process ID to the BPF map tracing_tasks, and creates pidfd to trace the process exit.
• After the container process exits, the sched_process_exit program records the process exit code to exited_events in BPF map.
• After detecting the process exit event through pidfd, containerd reads the process exit code from exited_events in BPF map.
Figure 1 Shimless implementation process

Based on the new container exit awareness mechanism, you can use the eBPF kernel capability to implement shim-related capabilities and move other capabilities to containerd to eliminate the shim processes.
Working Principle of the Lightweight Probe
-
The container process updates the timestamp of the
file.pathfile periodically. -
The
kubeletlightweight probe periodically checks the modification timestamp of thefile.pathfile based on the user-configured detection interval. If the file modification time does not exceed the value offile.validity, the detection is successful. Otherwise, the detection fails. Other probe parameters are the same as those of other native probes.
Relationship with Related Features
None
Creating a Shimless Pod That Contains Lightweight Probes
Prerequisites
-
The kernel version is v5.4 or later, and the kernel compilation option CONFIG_DEBUG_INFO_BTF is set to y. If cgroup v2 is used, the recommended kernel version is v5.14 or later.
grep BTF /boot/config-$(uname -r)
Supported: CONFIG_DEBUG_INFO_BTF=y
Not supported: # CONFIG_DEBUG_INFO_BTF is not set -
The probe file must be correctly created for a container and periodically updated.
-
The shimless configuration cannot be modified during the system running. Otherwise, both shim and shimless processes exist.
Context
You need to understand how to enable the InPlacePodVerticalScaling feature of the kube-apiserver and kubelet components.
Constraints
None
Procedure
Eliminating the Shim Processes
-
Ensure that containerd (to be replaced) and containerd-shim-shimless-v2 (new) are in the correct path, for example,
/usr/local/bin/. -
Run the
vi /etc/containerd/config.tomlcommand to configure the config file.Add the following configuration items to the [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] section:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.unified]
base_runtime_spec = ""
cni_conf_dir = ""
cni_max_conf_num = 0
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
privileged_without_host_devices_all_devices_allowed = false
runtime_engine = ""
runtime_path = "/usr/local/bin/containerd-shim-shimless-v2"
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
sandbox_mode = "podsandbox"
snapshotter = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.unified.options]
BinaryName = ""
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true -
Modify the containerd.service file and add the
LimitMEMLOCK=infinityconfiguration item to the Service table. -
Run the
kubectl apply -f xxx.yamlcommand to create RuntimeClass. The following is an example ofxxx.yaml.kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: unified
handler: unified -
Start the Pod and add
runtimeClassName: unifiedto the Pod spec to import RuntimeClass. The following is an example:apiVersion: apps/v1
kind: Deployment
metadata:
name: shimless-pod
spec:
replicas: 10
selector:
matchLabels:
app: shimless-pod
template:
metadata:
labels:
app: shimless-pod
spec:
restartPolicy: Always
runtimeClassName: unified # Import RuntimeClass to the Pod spec.
containers:
- name: shimless-container
image: registry.k8s.io/e2e-test-images/nginx:1.15-2
resources:
limits:
cpu: 300m
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 1
periodSeconds: 1
timeoutSeconds: 1 -
Verification: Create a shimless Pod. After 10 Pods are created, run the
ps -ef|grep shimcommand to check whether the number of shim processes increases. The result shows that the number of shim processes does not increase.
Lightweight Probe
-
Mount an empty volume to the service Pod and create a target file in the empty volume.
-
The container needs to periodically update the target file, indicating that the container is alive.
-
Configure a lightweight probe. Ensure that the path is the same as that in the empty volume. The following is an example:
apiVersion: v1
kind: Pod
metadata:
name: file-probe-success
spec:
nodeSelector:
my-node: master
containers:
- name: container1
image: registry.k8s.io/e2e-test-images/busybox:1.29-2
command: ["/bin/sh", "-c"]
args:
- |
mkdir -p /probe-dir;
touch /probe-dir/probe; # Simulate the periodic update of the target file by the service.
while true; do # Simulate the periodic update of the target file by the service.
touch /probe-dir/probe; # Simulate the periodic update of the target file by the service.
sleep 20; # Simulate the periodic update of the target file by the service.
done
readinessProbe:
file: # Lightweight probe.
path: /probe-volume/probe # Path of the target file in the empty volume
validity: 30 # Tolerable update period. If the period between this detection time and the last detection time is less than 30 seconds, the probe returns success. Otherwise, the probe returns a failure.
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
livenessProbe:
file: # File probe
path: /probe-volume/probe # Destination path
validity: 10 # Tolerable update time difference
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
volumeMounts:
- name: probe-volume
mountPath: /probe-dir
volumes:
- name: probe-volume # Empty volume
emptyDir: {} -
Run the
kubectl apply -f deploy.yamlcommand to create a Pod. Then the probe is ready.
Related Operations
None