Skip to main content
Version: v25.09

High-Density Container Deployment

Feature Overview

When a large number of Pods are deployed on worker nodes, the following problems may occur:

•·When containerd is running, it creates a shim process for each Pod for Pod management and monitoring. When there are a large number of Pods, the shim processes occupy a large amount of memory.

The shim processes and containers are in 1:1 mapping. The shim processes created by containerd are written in Go and occupy about 20 MB memory. When there are 1,000 Pods, about 20 GB memory is occupied.

• The CPU usage of the native kubelet probe mechanism (exec/http) is high. When there are a large number of Pods, the detection failure rate increases.

The native probe mechanism runs through the CRI interface. When there are a large number of containers, a large number of concurrent calls are made, causing high system noise floor and slowing down the system response.

• A large number of containers generate a large number of cgroups. As a result, the management performance and reliability of systemd are reduced.

Therefore, the preceding problems need to be optimized and mitigated.

Application Scenarios

• The lightweight probe is correctly configured. The service side periodically updates the target file. In this case, the probe detects the file update and returns success.

• The lightweight probe is correctly configured. The container fails to be started and the target file cannot be periodically updated. In this case, the probe detects that the file update duration exceeds the configured value. As a result, the probe returns a failure.

Highlights

• The shim processes are eliminated.

The shim function is implemented in containerd, and the kernel eBPF technology is used to monitor process exit, reducing the memory usage of the container base.

• A lightweight kubelet probe is created.

The kubelet probe is implemented using low-cost methods such as file detection instead of message interaction or CRI interaction, reducing the CPU usage of the container base.

Implementation Principles

Design Principles

  1. By default, the community-native capabilities are retained, and new functions are disabled by default.
  2. The changes to the native code are minimized.
  3. The basic information about the request access is recorded at the minimum storage cost.

New Parameters

ParameterDefault ValueDescription
fileAn empty stringLightweight probe file.
pathAn empty stringLocation of the target file.
validity0If the modification time is within the value range, the probe returns success. Otherwise, the probe returns a failure.

Shimless Implementation Solution

After containerd starts the process, it creates pidfd to trace the process exit event and records the process exit code using the sched_process_exit eBPF program. Even if a container process exits during the containerd restart, the kernel eBPF program does not exit. After containerd is started, the exit code recorded by the eBPF program can be read to identify the process exit event during the restart. The procedure is as follows:

• When containerd is started, the eBPF program sched_process_exit is loaded in the kernel.

containerd uses runc to start the container, records the process ID to the BPF map tracing_tasks, and creates pidfd to trace the process exit.

• After the container process exits, the sched_process_exit program records the process exit code to exited_events in BPF map.

• After detecting the process exit event through pidfd, containerd reads the process exit code from exited_events in BPF map.

Figure 1 Shimless implementation process

Implementation solution

Based on the new container exit awareness mechanism, you can use the eBPF kernel capability to implement shim-related capabilities and move other capabilities to containerd to eliminate the shim processes.

Working Principle of the Lightweight Probe

  1. The container process updates the timestamp of the file.path file periodically.

  2. The kubelet lightweight probe periodically checks the modification timestamp of the file.path file based on the user-configured detection interval. If the file modification time does not exceed the value of file.validity, the detection is successful. Otherwise, the detection fails. Other probe parameters are the same as those of other native probes.

None

Creating a Shimless Pod That Contains Lightweight Probes

Prerequisites

  • The kernel version is v5.4 or later, and the kernel compilation option CONFIG_DEBUG_INFO_BTF is set to y. If cgroup v2 is used, the recommended kernel version is v5.14 or later.

    grep BTF /boot/config-$(uname -r)

    Supported: CONFIG_DEBUG_INFO_BTF=y

    Not supported: # CONFIG_DEBUG_INFO_BTF is not set
  • The probe file must be correctly created for a container and periodically updated.

  • The shimless configuration cannot be modified during the system running. Otherwise, both shim and shimless processes exist.

Context

You need to understand how to enable the InPlacePodVerticalScaling feature of the kube-apiserver and kubelet components.

Constraints

None

Procedure

Eliminating the Shim Processes

  1. Ensure that containerd (to be replaced) and containerd-shim-shimless-v2 (new) are in the correct path, for example, /usr/local/bin/.

  2. Run the vi /etc/containerd/config.toml command to configure the config file.

    Add the following configuration items to the [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] section:

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.unified]
    base_runtime_spec = ""
    cni_conf_dir = ""
    cni_max_conf_num = 0
    container_annotations = []
    pod_annotations = []
    privileged_without_host_devices = false
    privileged_without_host_devices_all_devices_allowed = false
    runtime_engine = ""
    runtime_path = "/usr/local/bin/containerd-shim-shimless-v2"
    runtime_root = ""
    runtime_type = "io.containerd.runc.v2"
    sandbox_mode = "podsandbox"
    snapshotter = ""

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.unified.options]
    BinaryName = ""
    CriuImagePath = ""
    CriuPath = ""
    CriuWorkPath = ""
    IoGid = 0
    IoUid = 0
    NoNewKeyring = false
    NoPivotRoot = false
    Root = ""
    ShimCgroup = ""
    SystemdCgroup = true
  3. Modify the containerd.service file and add the LimitMEMLOCK=infinity configuration item to the Service table.

  4. Run the kubectl apply -f xxx.yaml command to create RuntimeClass. The following is an example of xxx.yaml.

    kind: RuntimeClass
    apiVersion: node.k8s.io/v1
    metadata:
    name: unified
    handler: unified
  5. Start the Pod and add runtimeClassName: unified to the Pod spec to import RuntimeClass. The following is an example:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: shimless-pod
    spec:
    replicas: 10
    selector:
    matchLabels:
    app: shimless-pod
    template:
    metadata:
    labels:
    app: shimless-pod
    spec:
    restartPolicy: Always
    runtimeClassName: unified # Import RuntimeClass to the Pod spec.
    containers:
    - name: shimless-container
    image: registry.k8s.io/e2e-test-images/nginx:1.15-2
    resources:
    limits:
    cpu: 300m
    ports:
    - containerPort: 80
    readinessProbe:
    httpGet:
    path: /
    port: 80
    initialDelaySeconds: 1
    periodSeconds: 1
    timeoutSeconds: 1
  6. Verification: Create a shimless Pod. After 10 Pods are created, run the ps -ef|grep shim command to check whether the number of shim processes increases. The result shows that the number of shim processes does not increase.

Lightweight Probe

  1. Mount an empty volume to the service Pod and create a target file in the empty volume.

  2. The container needs to periodically update the target file, indicating that the container is alive.

  3. Configure a lightweight probe. Ensure that the path is the same as that in the empty volume. The following is an example:

    apiVersion: v1
    kind: Pod
    metadata:
    name: file-probe-success
    spec:
    nodeSelector:
    my-node: master
    containers:
    - name: container1
    image: registry.k8s.io/e2e-test-images/busybox:1.29-2
    command: ["/bin/sh", "-c"]
    args:
    - |
    mkdir -p /probe-dir;
    touch /probe-dir/probe; # Simulate the periodic update of the target file by the service.
    while true; do # Simulate the periodic update of the target file by the service.
    touch /probe-dir/probe; # Simulate the periodic update of the target file by the service.
    sleep 20; # Simulate the periodic update of the target file by the service.
    done
    readinessProbe:
    file: # Lightweight probe.
    path: /probe-volume/probe # Path of the target file in the empty volume
    validity: 30 # Tolerable update period. If the period between this detection time and the last detection time is less than 30 seconds, the probe returns success. Otherwise, the probe returns a failure.
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 1
    failureThreshold: 3
    livenessProbe:
    file: # File probe
    path: /probe-volume/probe # Destination path
    validity: 10 # Tolerable update time difference
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 1
    failureThreshold: 3
    volumeMounts:
    - name: probe-volume
    mountPath: /probe-dir
    volumes:
    - name: probe-volume # Empty volume
    emptyDir: {}
  4. Run the kubectl apply -f deploy.yaml command to create a Pod. Then the probe is ready.

None