Version: v25.09

NUMA-aware Scheduling Development Guide

Introduction

The non-uniform memory access (NUMA) architecture has become increasingly prevalent in modern high-performance computing and large-scale distributed systems. In this architecture, memory is allocated to multiple nodes (NUMA nodes), each with its own local memory and CPUs. This design helps reduce memory access latency and improves system performance. However, the complexity of the NUMA architecture also introduces challenges in system resource management, especially in multi-task and multi-threaded environments. To fully leverage the benefits of the NUMA architecture, refined management and monitoring of system resources are essential. Through an intuitive graphical interface, visualized NUMA resource monitoring provides real-time insights into the allocation and utilization of NUMA resources. This enables users to better understand and manage these resources, improving system performance and resource utilization. In containerized clusters, resource scheduling is typically handled by various schedulers. NUMA-aware scheduling employs tailored optimization methods for tasks with high, medium, and low priorities. It ensures optimal pod allocation through cluster-level and intra-node NUMA scheduling, boosting system performance. For details, see NUMA-aware Scheduling.

Restrictions

To enable this feature, modify the kubelet configuration file by performing the following steps. These steps will be automatically performed when the NUMA affinity policy or the optimal NUMA distance policy is enabled.
1. Open the configuration file of kubelet.
```
vi /var/lib/kubelet/config.yaml
```
2. Add or modify the configuration items.
```
cpuManagerPolicy:static
topologyManagerPolicy:xxx
```
3. Apply the modifications.
```
rm -rf /var/lib/kubelet/cpu_manager_state
systemctl daemon-reload
systemctl restart kubelet
```
4. Check the status of kubelet.
```
systemctl status kubelet
```
  The Running state indicates success.
  
  NOTE
  Modifying the topology policy of a node will restart kubelet, which may cause some pods to be rescheduled. Proceed with caution in production environments.

To enable the NUMA-Fast feature, you need to modify the system configuration in advance. The procedure is as follows:

Modify the kernel command line parameters of the operating system (OS).

vim /etc/grub2-efi.cfg

Locate the line starting with Linux that corresponds to the current OS image and append the following container runtime parameters.

mem_sampling_on numa_icon=enable

The following is an example:

linux   /vmlinuz-5.10.0-216.0.0.115.oe2203sp4.aarch64 root=/dev/mapper/openeuler-root ro rd.lvm.lv=openeuler/root rd.lvm.lv=openeuler/swap video=VGA-1:640x480-32@60me cgroup_disable=files apparmor=0 crashkernel=1024M,high smmu.bypassdev=0x1000:0x17 smmu.bypassdev=0x1000:0x15 arm64.nopauth console=tty0 kpti=off mem_sampling_on numa_icon=enable

Modify the configuration file of containerd.
```
vi /etc/containerd/config.toml
```
Modify the content as follows:
```
[plugins."io.containerd.nri.v1.nri"]
disable = false
```
Restart containerd to apply the update.
```
systemctl restart containerd
```

NOTE
Currently, the NUMA-Fast feature is supported only in environments running openEuler 22.03 LTS SP4 in the Arm architecture. This feature has a lower priority than CPU pinning. It will be automatically disabled if there are pods with CPU pinning in the cluster.

To enable the cluster-level NUMA monitoring feature, you need to configure Prometheus in advance.
1. For details about how to connect Prometheus Operator to NUMA Exporter, see Task Scenario 2.
2. For details about how to connect Prometheus to NUMA Exporter, see Task Scenario 3.

Environment Preparation

Environment Requirements

Kubernetes 1.21 or later has been deployed.
Prometheus has been deployed.
containerd 1.7 or later has been deployed.

Environment Deployment

For details, see NUMA-aware Scheduling User Guide.

Environment Verification

Take Volcano as an example. If the status of all pods in the volcano-system namespace is Running, the environment is successfully set up.

volcano-system volcano-admission-xx-xx           1/1  Running  0  xmxxs
volcano-system volcano-admission-init-xx         1/1  Running  0  xmxxs
volcano-system volcano-controllers-xx-xx         1/1  Running  0  xmxxs
volcano-system volcano-schedulers-xx-xx          1/1  Running  0  xmxxs
volcano-system volcano-exporter-daemonset-xx     n/n  Running  0  xmxxs
volcano-system volcano-config-website-xx-xx      1/1  Running  0  xmxxs
volcano-system volcano-config-xx-xx              1/1  Running  0  xmxxs
volcano-system numa-exporter-xx                  n/n  Running  0  xmxxs
default        numaadj-xx	                      n/n  Running  0  xmxxs

Task Scenario 1: Modifying a Scheduling Policy in Volcano

Scenario Overview

Configure the scheduling policy in Volcano.

System Architecture

The management and control layer leverages the openFuyao platform's console-website capabilities to provide a frontend interface for NUMA topology visualization, status monitoring, and scheduling policy configuration. In a cluster, both Kubernetes schedulers and user-defined schedulers are supported. A scheduler works with the kubelet to schedule resources to nodes. When pods are running, the NUMA FAST capability is used to optimize pod affinity scheduling. Pods with strong network affinity are scheduled to the same NUMA node, improving overall performance and delivering an end-to-end optimized scheduling solution.

Figure 1 NUMA-aware scheduling architecture

System Architecture

API Description (Taking numa-aware as an Example)

Table 1 Main APIs

API	Description
GET /rest/scheduling/v1/numaaware	Query the NUMA-aware scheduling policy.
PUT /rest/scheduling/v1/numaaware	Modify the NUMA-aware scheduling policy.

Development Procedure

Check the configuration file to verify whether the numa-aware plug-in is available. In addition, you can modify the configuration file to enable or disable the NUMA-aware extended scheduling policy. The core code is as follows:

func methodPut(clientset kubernetes.Interface, plugin string, status string) (*httputil.ResponseJson, int) {
    // Obtain the configuration file of Volcano.
    configmap, err := k8sutil.GetConfigMap(clientset, constant.VolcanoConfigServiceConfigmap,
       constant.VolcanoConfigServiceDefaultNamespace)
    if err != nil {
       zlog.Errorf("GetConfigMap failed, %v", err)
       return httputil.GetDefaultServerFailureResponseJson(), http.StatusInternalServerError
    }

	// Obtain specific content of the configuration file.
    configContent, exists := configmap.Data[constant.VolcanoConfigServiceConfigmapName]
    zlog.Infof("configContent: %v", configContent)
    if !exists {
       zlog.Errorf("GetConfigMap failed, content is empty")
       return httputil.GetDefaultServerFailureResponseJson(), http.StatusInternalServerError
    }

	// Modify the policy status.
    str := "name: " + plugin
    if status == "open" {
       if !strings.Contains(configContent, str) {
          configContent = insertPlugin(configContent)
       }
    } else if status == "close" {
       if strings.Contains(configContent, str) {
          configContent = removePlugin(configContent, plugin)
       }
    } else {
       zlog.Errorf("status is not open or close, %v", err)
       return httputil.GetDefaultServerFailureResponseJson(), http.StatusInternalServerError
    }

	// Update the configuration file.
    configmap.Data[constant.VolcanoConfigServiceConfigmapName] = configContent
    configmap, err = k8sutil.UpdateConfigMap(clientset, configmap)
    if err != nil {
       zlog.Errorf("UpdateConfigMap failed, %v", err)
       return httputil.GetDefaultServerFailureResponseJson(), http.StatusInternalServerError
    }

	//  Restart the scheduler.
    err = k8sutil.DeleteVolcanoPod(clientset, constant.VolcanoConfigServiceDefaultNamespace)
    if err != nil {
       zlog.Errorf("delete pod volcano-scheduler failed, %v", err)
       return httputil.GetDefaultServerFailureResponseJson(), http.StatusInternalServerError
    }

    return &httputil.ResponseJson{
       Code: constant.Success,
       Msg:  "success",
    }, http.StatusOK
}

Debugging and Verification

Call the API corresponding to this task scenario to check whether the scheduling policy can be viewed or modified.

Call the API. Change ClusterIP in the service to NodePort and use the NodePort value to access the API.
```
curl -X GET  "http://192.168.100.59:NodePort/rest/scheduling/v1/numaaware"
```
View the Volcano configuration file in the cluster.
```
kubectl edit configmaps volcano-scheduler-configmap -n volcano-system
```
Check whether the file contains the - numa-aware entry. If so, the NUMA-aware scheduling policy is enabled.

Check whether the scheduling policy takes effect.

Create a Deployment with the affinity policy set to single-numa-node, use Volcano as the scheduler, and check whether the scheduling is performed as expected.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
      annotations:
        # Specify the NUMA affinity.
        volcano.sh/numa-topology-policy: single-numa-node
    spec:
      # Specify Volcano as the scheduler.
      schedulerName: volcano
      containers:
      - name: nginx
        image: nginx
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            cpu: 1
            memory: 100Mi
          requests:
            cpu: 1
            memory: 100Mi

Task Scenario 2: Connecting Prometheus Operator to NUMA Exporter

Scenario Overview

Prometheus Operator uses custom resource definitions (CRDs) to manage Prometheus configurations, including Prometheus, AlertManager, ServiceMonitor, and PodMonitor. ServiceMonitor defines how Prometheus discovers and scrapes monitoring data of specific services. It specifies one or more Kubernetes services as scrape targets for Prometheus.

Development Procedure

Edit the YAML file of ServiceMonitor.

Create a YAML file named numaExporter-serviceMonitor.yaml. The file content is as follows:

   apiVersion: monitoring.coreos.com/v1
   kind: ServiceMonitor
   metadata:
     labels:
       name: numa-exporter
     name: numa-exporter
     namespace: volcano-system
   spec:
     endpoints:
       - interval: 30s
         port: https
     selector:
       matchLabels:
         app: numa-exporter
     namespaceSelector:
       matchNames:
       - volcano-system

Apply ServiceMonitor resources so that Prometheus can discover NUMA Exporter and scrape monitoring data.

Run the following kubectl command to apply ServiceMonitor resources to the Kubernetes cluster.
```
kubectl apply -f numaExporter-serviceMonitor.yaml
```

Debugging and Verification

Log in to Prometheus.

To access the Prometheus page, enter http://prometheus-server:9090 in the address box of a browser.

NOTE
If Prometheus is configured with authentication or other security settings, enter the required credentials.
Verify the monitored objects.

2.1 Check Prometheus targets.

On the Prometheus web page, navigate to the Targets page (usually found under the Status menu) and check whether the numa-exporter monitored objects have been correctly discovered and scraped.

2.2 View the data.

Enter numa_node_cpus_count in the search box to obtain the metric data.

Task Scenario 3: Connecting Prometheus to NUMA Exporter

Scenario Overview

In a Kubernetes cluster, a ConfigMap is used to manage and update the Prometheus configuration file (prometheus.yml).

Development Procedure

Edit the ConfigMap.

Run the following command to find the ConfigMap that stores prometheus.yml:

kubectl get configmap -n monitoring

Use the editor to open and modify the ConfigMap.

kubectl edit configmap <prometheus-config-name> -n monitoring

In the editor, add or modify the scrape_configs section.

scrape_configs:
- job_name: 'numa-exporter'
 static_configs:
 - targets: ['numa-exporter.volcano-system:9201']

Reload the Prometheus configuration.

Save and exit the editor, and hot-reload Prometheus to apply the changes.

curl -X POST http://prometheus-server:9090/-/reload # Replace the fields in the URL with the actual hostname or IP address and port number where Prometheus is running.

Debugging and Verification

Log in to Prometheus.

To access the Prometheus page, enter http://prometheus-server:9090 in the address box of a browser.

NOTE
If Prometheus is configured with authentication or other security settings, enter the required credentials.
Verify the monitored objects.
- Check Prometheus targets: On the Prometheus web page, navigate to the Targets page (usually found under the Status menu) and check whether the numa-exporter monitored objects have been correctly discovered and scraped.
- View the data: Enter numa_node_cpus_count in the search box to obtain the metric data.

Task Scenario 4: Scheduling Non-Exclusive Pods to the Same NUMA Node Based on Affinity

Scenario Overview

Pods with significant network communication with each other can be treated as a group of related pods. Scheduling this group of related pods to the same NUMA node can effectively reduce cross-NUMA access, thereby improving system throughput.

System Architecture

Obtaining pod affinity: The NUMA-FAST module of the OS is used to collect statistics on thread affinity and convert the statistics into pod affinity.
Optimizing runtime affinity: The node resource interface (NRI) is used to perceive resource awareness and store inter-pod affinity. When a pod is created, cpuset/memset of the pod is modified so that pods with affinity are allocated to the same NUMA node. The following figure shows the principles behind pod affinity scheduling.

Figure 2 Principles of pod affinity scheduling

System Architecture

Development Procedure

Run the following command to view the affinity relationship between pods:

kubectl get oenuma podafi -n tcs -o yaml

In the spec.node.numa field, you can view the affinity relationship between pods.

Debugging and Verification

Check whether pods with thread affinity are scheduled to the same NUMA node before and after the deployment.

Run the following command to query the thread ID. NGINX is used as an example.
```
ps -ef | grep nginx
```
Run the following command to obtain the PID:
```
cat /proc/#pid/status
```

Check the Cpus_allowed field. If it shows that the pods have been scheduled to a single NUMA node, then CPU pinning for affinity pods has been successfully applied.

FAQ

What can I do if the NUMA affinity policy does not take effect?

Symptom

A workload is configured with the NUMA affinity policy single-numa-node and scheduled with Volcano, but it fails to be scheduled to a node whose topology policy is also single-numa-node.
Possible Causes
1. The numa-aware plug-in is not enabled in the configuration file.
2. The workload is configured with anti-affinity or taints. As a result, the workload cannot be scheduled to the node.
3. Node resources are fully utilized.
Solution

Enable the numa-aware plug-in, check whether anti-affinity is configured, and free up resources on the node.

Introduction​

Restrictions​

Environment Preparation​

Environment Requirements​

Environment Deployment​

Environment Verification​

Task Scenario 1: Modifying a Scheduling Policy in Volcano​

Scenario Overview​

System Architecture​

API Description (Taking numa-aware as an Example)​

Development Procedure​

Debugging and Verification​

Task Scenario 2: Connecting Prometheus Operator to NUMA Exporter​

Scenario Overview​

Development Procedure​

Debugging and Verification​

Task Scenario 3: Connecting Prometheus to NUMA Exporter​

Scenario Overview​

Development Procedure​

Debugging and Verification​

Task Scenario 4: Scheduling Non-Exclusive Pods to the Same NUMA Node Based on Affinity​

Scenario Overview​

System Architecture​

Development Procedure​

Debugging and Verification​

FAQ​

Introduction

Restrictions

Environment Preparation

Environment Requirements

Environment Deployment

Environment Verification

Task Scenario 1: Modifying a Scheduling Policy in Volcano

Scenario Overview

System Architecture

API Description (Taking numa-aware as an Example)

Development Procedure

Debugging and Verification

Task Scenario 2: Connecting Prometheus Operator to NUMA Exporter

Scenario Overview

Development Procedure

Debugging and Verification

Task Scenario 3: Connecting Prometheus to NUMA Exporter

Scenario Overview

Development Procedure

Debugging and Verification

Task Scenario 4: Scheduling Non-Exclusive Pods to the Same NUMA Node Based on Affinity

Scenario Overview

System Architecture

Development Procedure

Debugging and Verification

FAQ