NPU Operator

Feature Introduction

Kubernetes provides access to special hardware resources (such as Ascend NPU) through Device Plugin. However, configuring and managing nodes with these hardware resources requires configuring multiple software components (such as drivers, container runtimes, or other libraries), which are complex and error-prone to install. NPU Operator uses the Operator Framework in Kubernetes to automatically manage all software components required for configuring Ascend devices. These components include Ascend Driver and Firmware, MindCluster device plugins that enable cluster-wide operations, support for cluster job scheduling, operations monitoring, and fault recovery. By installing the corresponding components, NPU resource management, workload optimized scheduling, and containerized support for training and inference tasks can be achieved, enabling AI jobs to be deployed and run as containers on NPU devices.

Table 1 Currently Supported Components

Component Name	Deployment Method	Component Function
Ascend Driver and Firmware	Containerized deployment managed by NPU Operator	Acts as a bridge between hardware devices and the operating system, allowing the operating system to recognize and communicate with hardware devices.
Ascend Device Plugin	Containerized deployment managed by NPU Operator	Device Discovery: Based on the Kubernetes device plugin mechanism, adds device discovery, device allocation, and device health status reporting functions for Ascend AI processors, enabling Kubernetes to manage Ascend AI processor resources.
Ascend Operator	Containerized deployment managed by NPU Operator	Environment Configuration: Volcano coordination component, responsible for managing acjob type tasks, injecting environment variables required by AI frameworks (MindSpore/PyTorch/TensorFlow) training tasks into containers, then Volcano takes over scheduling.
Ascend Docker Runtime	Containerized deployment managed by NPU Operator	Ascend Container Runtime: Container engine plugin, provides NPU containerization support for all AI jobs, enabling users to smoothly run AI jobs as Docker containers on Ascend devices.
NPU Exporter	Containerized deployment managed by NPU Operator	Real-time monitoring of Ascend AI processor resource data: This function supports real-time collection of various resource data from Ascend AI processors, including processor utilization, temperature, voltage, and memory usage. Additionally, it can monitor vNPU of Atlas inference series products, including key metrics such as AI Core utilization, vNPU total memory, and used memory.
Resilience Controller	Containerized deployment managed by NPU Operator	Dynamic Scaling: When a fault occurs during task training and there are insufficient healthy resources for replacement, this component can use dynamic scaling to remove faulty resources and continue training. After sufficient resources become available, training tasks can be resumed through dynamic scaling.
ClusterD	Containerized deployment managed by NPU Operator	Collects cluster task information, resource information, and fault information, uniformly determines fault handling levels and strategies, and controls process recomputation of training containers.
Volcano	Containerized deployment managed by NPU Operator	Obtains cluster resource information from underlying components, selects optimal scheduling strategies and resource allocation by sensing the network connection methods between Ascend chips, and can perform task rescheduling when task resource faults occur.
NodeD	Containerized deployment managed by NPU Operator	Detects node resource monitoring status and node fault information, reports fault information, and prevents new tasks from being scheduled on faulty nodes.
MindIO	Containerized deployment managed by NPU Operator	Handles the generation and saving of terminal CheckPoint after model training interruption, online repair of UCE faults in on-chip memory during model training, provides the ability to complete fault repair and model resumption training through restart or node replacement, and optimizes CheckPoint saving and loading.

For detailed information about components, please refer to MindCluster Introduction.

Component Version Compatibility

Component Name	Version
Ascend Driver and Firmware	25.3.RC1
Ascend Device Plugin	7.2.RC1
Ascend Operator	7.2.RC1
Ascend Docker Runtime	7.2.RC1
NPU Exporter	7.2.RC1
Resilience Controller	7.2.RC1
ClusterD	7.2.RC1
Volcano	7.2.RC1 (based on original Volcano 1.9.0)
NodeD	7.2.RC1
MindIO	7.2.RC1

Application Scenarios

Building clusters based on Ascend devices, supporting cluster job scheduling, operations monitoring, and fault recovery scenarios. NPU Operator can automatically identify Ascend nodes in the cluster and perform corresponding installation and deployment work. For training scenarios, it supports NPU resource detection, full card scheduling, static vNPU scheduling, resuming training from breakpoint, and elastic training. For inference scenarios, it supports resource detection, full card scheduling, static vNPU scheduling, dynamic vNPU scheduling, inference card fault recovery, and rescheduling functions.

Capability Scope

Automatically discover Ascend NPU device nodes and label the nodes.
Automatically deploy Ascend NPU driver firmware.
MindCluster automated deployment installation and lifecycle management for cluster scheduling components.

Highlight Features

NPU Operator can automatically identify Ascend nodes and device models in the cluster, and install the corresponding versions of necessary components for AI runtime, greatly simplifying the threshold for configuring Ascend ecosystem components. It provides full lifecycle management and automated configuration deployment for installed components. NPU Operator can detect component installation status and provide detailed logs for debugging.

Implementation Principle

Operator monitors CRD instantiated CR changes to modify managed component states.
Operator uses labels marked by NFD on nodes, utilizing the npu-feature-discovery component to label nodes with labels suitable for Ascend component scheduling.

Please ensure that application-management-service and marketplace-service are running normally to ensure this feature can be installed normally from the application market.

Operator Security Context

Some NPU Operator managed Pods (such as driver containers) require elevated privileges as follows.

privileged: true
hostPID: true
hostIPC: true
hostNetwork: true

Reasons for elevated privileges:

Access host file system and hardware devices, install driver firmware and SDK services on the host machine.
Modify device permissions to adapt for non-root users.

Installation

openFuyao Platform Deployment

Online Installation

Prerequisites

kubectl and Helm CLI are available on the current computer, or there is a configurable application store or repository in the cluster.
Please confirm that the environment contains the bash tool, otherwise the driver firmware installation script parsing may fail.
All worker nodes or node groups running NPU workloads in the Kubernetes cluster must have an operating system version that meets openEuler 22.03 LTS or Ubuntu 22.04 (ARM architecture).
For worker nodes or node groups that only run CPU workloads, nodes can run any operating system, because NPU Operator will not perform any configuration or management on nodes with non-NPU workloads.
The components installed by the current NPU Operator require the running environment to meet NPU chip models 910B and 310P. For specific OS and hardware compatibility, please refer to MindCluster Documentation.
Node Feature Discovery (NFD) and NPU Feature Discovery (NPU-Feature-Discovery) are dependencies of Operator on each node.
Note:
By default, NFD master and worker nodes are automatically deployed by Operator. If NFD is already running in the cluster, you must disable NFD deployment when installing Operator. Similarly, if NPU-Feature-Discovery has already been deployed to the cluster, you also need to disable NPU-Feature-Discovery deployment when installing Operator.
values.yaml
yaml
```
nfd:
  enabled: false
npu-feature-discovery：
  enabled: false
```
Check NFD labels on nodes to determine if NFD is already running in the cluster.
sh
```
kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'
```
If the command output is true, NFD is already running in the cluster. In this case, set nodefeaturerules to install NPU custom node discovery rules.
yaml
```
nfd:
  nodefeaturerules: true
```
By default, nfd is true, npu-feature-discovery is true, and nodefeaturerules is false.

Installation Steps

NPU Operator extension component can be downloaded and installed from the openFuyao application market.

Enter the openFuyao platform and select "Application Market > Application List" from the left navigation bar.
Search for "npu-operator" in the application list to find the NPU Operator extension component.
Click the NPU Operator card to enter the application details page.
On the details page, click "Deploy" in the upper right corner, and enter "Application Name", "Version Information", and "Namespace" in the "Installation Information" module on the deployment interface.
Click "Confirm" to successfully deploy the component.
Note:
Currently, the online installation function supports driver firmware installation for 910B and 310P series models. Online installation of 910C model driver firmware is not currently supported. If you need to install 910C model driver firmware, please refer to the Offline Installation section. When installing and deploying the npu-operator component through the application market, you can modify the corresponding values.yaml parameters. For details, please refer to Table 2.

Offline Installation

Prerequisites

Please refer to Online Installation Prerequisites.
Download offline images: Download all images used by the components to be installed locally.
Prepare driver firmware zip package and MindIO component zip package:
- Download driver firmware zip package: Go to npu-driver-installer's repository, find the config.json file for the corresponding driver firmware version, and click the corresponding link according to the NPU model and OS architecture of the corresponding node to download the corresponding driver firmware zip package; 910C model NPU supports offline driver firmware installation, which needs to be downloaded separately from Ascend Community - Firmware and Driver Download. Driver firmware zip packages for other models can also be downloaded from the above link.
- Download MindIO component zip package: Go to npu-node-provision's repository, find the config.json file for the corresponding component version, and click the corresponding link according to the NPU model and OS architecture of the corresponding node to download the corresponding SDK zip package.
Place the driver firmware zip file in the node path for offline installation: /tmp/driver_pkg/. This path can be customized. For specific modification methods, please refer to the driver.env field in Table 2.
Place the MindIO component zip package in the node path for offline installation: /opt/openFuyao/mindio/
Check if the following tools exist in the nodes to be installed:
- If the package management tool is yum, the packages to be installed are: "jq wget unzip which initscripts coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch kernel-devel-(uname -r) dkms"
- If the package management tool is apt-get, the packages to be installed are: "jq wget unzip debianutils coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch dkms linux-headers-$(uname -r)"
- If the package management tool is dnf, the packages to be installed are: "jq wget unzip which initscripts coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch kernel-devel-(uname -r) dkms"

Installation Steps

Please refer to Online Installation Steps.

Standalone Deployment

Online Installation

Prerequisites

Please refer to openFuyao Platform Deployment Prerequisites.

Installation Steps

Add the openFuyao Helm repository.

shell

helm repo add openfuyao https://helm.openfuyao.cn && helm repo update

Install NPU Operator.

Install Operator using default configuration:

shell

helm install --wait --generate-name \
-n default --create-namespace \

openfuyao/npu-operator

If installing via Helm application store or application market, add the https://helm.openfuyao.cn repository and install through the interface. For details, please refer to Common Customization Options.

Note:
After installing NPU Operator, labels related to NPU resources will be added to nodes according to different node environments. These labels are related to cluster scheduling components. The accelerate-type label requires the node's hardware server to fully match the NPU card. For specific matching relationships, please refer to MindCluster Documentation's Creating Node Labels.
For A800I A2 inference servers, automatic addition of server-usage=infer label is not currently supported. Users need to manually add it using the following command.

bash

kubectl label nodes <node-name> server-usage=infer

Common Customization Options

When using Helm Chart, the following options can be modified. These options are used during Helm installation through --set or --set-json (used when modifying component environment variables, volumes, and other list structures). Due to the large number of configuration items, it is recommended to directly modify the corresponding fields in the values.yaml file in the chart package.

Table 2 lists the most commonly used fields. For other fields, please refer to the repository.

Table 2 Common Options

Scope	Description	Default
`nfd.enabled`	Deploy Node Feature Discovery service NFD. If NFD is already running in the cluster, set this variable to `false`. Note: If set to `true` during installation, try not to modify this field to `false` during use, otherwise NFD residual labels will remain after uninstallation.	`true`
`nfd.nodefeaturerules`	When set to `true`, install NFD discovered NPU device rules CR.	`true`
`node-feature-discovery.image.repository`	NFD service image address.	`registry.k8s.io/nfd/node-feature-discovery`
`node-feature-discovery.image.pullPolicy`	NFD service image pull policy.	`Always`
`node-feature-discovery.image.tag`	NFD service image version.	`v0.16.4`
`npu-feature-discovery.images.core.repository`	NPU-Feature-Discovery image address.	`cr.openfuyao.cn/openfuyao/npu-feature-discovery`
`npu-feature-discovery.images.core.pullPolicy`	NPU-Feature-Discovery image pull policy.	`Always`
`npu-feature-discovery.images.core.tag`	NPU-Feature-Discovery image version.	`latest`
`npu-feature-discovery.enabled`	Switch for deploying NPU-Feature-Discovery. If NPU-Feature-Discovery is already running in the cluster, set this variable to false.	`true`
`images.operator.repository`	NPU Operator image address.	`cr.openfuyao.cn/openfuyao/npu-operator`
`images.operator.tag`	NPU Operator image version.	`latest`
`images.operator.pullPolicy`	NPU Operator image pull policy.	`Always`
`daemonSets.labels`	Custom labels to add to all NPU Operator managed Pods.	`{}`
`daemonSets.tolerations`	Custom tolerations to add to all NPU Operator managed Pods.	`[]`
`driver.enabled`	By default, Operator deploys NPU driver firmware program as a container on the system.	`true`
`images.driver.repository`	Driver firmware program image storage address.	`cr.openfuyao.cn/openfuyao/npu-driver-installer`
`images.driver.tag`	Driver firmware installation service image version.	`latest`
`driver.env`	Environment variables related to driver firmware installation service 1. name: HOST_DRIVER_SOURCE_PATH value: "/tmp/driver_pkg" 2. name: DRIVER_VERSION value: "25.3.RC1"	1. This environment variable specifies the placement path for the driver firmware zip package 2. This environment variable specifies the online download installation version for the driver firmware zip package.
`devicePlugin.enabled`	By default, Operator deploys NPU device plugin program on the system. When using Operator on a system with pre-installed device plugins, please set this value to `false`. Note: When modifying this field to `false`, other simultaneously set fields will not take effect.	`true`
`images.devicePlugin.repository`	Device plugin program image address.	`hub.oepkgs.net/openfuyao/ascendhub/ascend-k8sdeviceplugin`
`images.devicePlugin.tag`	Device plugin service image version.	`v7.2.RC1`
`trainer.enabled`	By default, Operator installs Ascend operator. If installation is not needed, set this value to `false`.	`true`
`images.trainer.repository`	Ascend operator image address.	`hub.oepkgs.net/openfuyao/ascendhub/ascend-operator`
`images.trainer.tag`	Ascend operator image version.	`v7.2.RC1`
`ociRuntime.enabled`	By default, Operator installs Ascend Docker Runtime. If installation is not needed, set this value to `false`.	`true`
`images.ociRuntime.repository`	Ascend Docker Runtime image address.	`cr.openfuyao.cn/openfuyao/npu-container-toolkit`
`images.ociRuntime.tag`	Ascend Docker Runtime image version.	`latest`
`nodeD.enabled`	By default, Operator installs nodeD. If installation is not needed, set this value to `false`.	`true`
`images.nodeD.repository`	nodeD image address.	`hub.oepkgs.net/openfuyao/ascendhub/noded`
`images.nodeD.tag`	nodeD image version.	`v7.2.RC1`
`clusterd.enabled`	By default, Operator installs clusterD component. If installation is not needed, set this value to `false`.	`true`
`images.clusterd.repository`	clusterD image address.	`hub.oepkgs.net/openfuyao/ascendhub/clusterd`
`images.clusterd.tag`	clusterD image version.	`v7.2.RC1`
`rscontroller.enabled`	By default, Operator installs resilience controller component. If installation is not needed, set this value to `false`.	`true`
`images.rscontroller.repository`	resilience controller image address.	`hub.oepkgs.net/openfuyao/ascendhub/resilience-controller`
`images.rscontroller.tag`	resilience controller image version.	`v7.1.RC1`
`exporter.enabled`	By default, Operator installs NPU Exporter component. If installation is not needed, set this value to `false`.	`true`
`images.exporter.repository`	NPU Exporter image address.	`cr.openfuyao.cn/openfuyao/npu-exporter`
`images.exporter.tag`	NPU Exporter image version.	`v7.2.RC1-of.1`
`mindiotft.enabled`	By default, Operator installs MindIO Training Fault Tolerance. If installation is not needed, set this value to `false`.	`true`
`images.mindiotft.repository`	MindIO Training Fault Tolerance image address.	`cr.openfuyao.cn/openfuyao/npu-node-provision`
`images.mindiotft.tag`	MindIO Training Fault Tolerance image version.	`latest`
`mindioacp.enabled`	By default, Operator installs MindIO Async Checkpoint Persistence. If installation is not needed, set this value to `false`.	`true`
`images.mindioacp.repository`	MindIO Async Checkpoint Persistence service image address.	`cr.openfuyao.cn/openfuyao/npu-node-provision`
`images.mindioacp.tag`	MindIO Async Checkpoint Persistence service image version.	`latest`
`mindioacp.version`	MindIO Async Checkpoint Persistence version.	`7.2.RC1`
`vccontroller.enabled`	By default, Operator installs volcano-controller. If installation is not needed, set this value to `false`.	`true`
`images.vccontroller.repository`	volcano-controller service image address.	`hub.oepkgs.net/openfuyao/ascendhub/vc-controller-manager`
`images.vccontroller.tag`	volcano-controller service image version.	`v1.9.0-v7.2.RC1`
`vcscheduler.enabled`	By default, Operator installs volcano-scheduler. If installation is not needed, set this value to `false`.	`true`
`images.vcscheduler.repository`	volcano-scheduler service image address.	`hub.oepkgs.net/openfuyao/ascendhub/vc-scheduler`
`images.vcscheduler.tag`	volcano-scheduler service image version.	`v1.9.0-v7.2.RC1`

Offline Installation

Prerequisites

Please refer to openFuyao Platform Deployment Prerequisites.

Installation Steps

Prepare the NPU Operator chart package in advance.
Install Operator using default configuration:
shell
```
cd npu-operator/charts
helm install <npu-operator release name> npu-operator
```
For common customization options, see Table 2.

Upgrade

NPU Operator supports dynamic updates to existing resources. This feature enables NPU Operator to ensure that NPU Policy settings in the cluster are always up to date.

Since Helm does not support automatic upgrades of existing CRDs, you can manually or enable Helm Hooks to upgrade NPU Operator Chart.

NPU Policy CR Update

NPU Operator supports dynamic updates to npuclusterpolicy CustomResource using kubectl.

shell

kubectl get npuclusterpolicy -A
# If the default npuclusterpolicy has not been modified, the default name of npuclusterpolicy is cluster.
kubectl edit npuclusterpolicy cluster

After editing, Kubernetes automatically applies the updates to the cluster. All components managed by NPU Operator will also be updated to the expected state.

Installation Status Verification

View Component Status via CustomResource

View component status through CustomResource npuclusterpolicies.npu.openfuyao.com. The specific method is to check the state field of each component in the status field to confirm the current status of the component. Below is an example of the driver installer running normally.

yaml

status:
  componentStatuses:
    - name: /var/lib/npu-operator/components/driver
      prevState:
        reason: Reconciling
        type: deploying
      state:
        reason: Reconciled
        type: running

View CustomResource

bash

$ kubectl get npuclusterpolicies.npu.openfuyao.com cluster -o yaml
apiVersion: npu.openfuyao.com/v1
kind: NPUClusterPolicy
metadata:
  annotations:
    meta.helm.sh/release-name: npu
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2025-03-11T13:22:39Z"
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: npu-operator
  name: cluster
  resourceVersion: "2240086"
  uid: 0d1498c5-143a-4e05-a5dc-376d2e6c96ea
spec:
  clusterd:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/clusterd
      tag: v6.0.0
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/clusterd/clusterd.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  daemonsets:
    imageSpec:
      imagePullPolicy: IfNotPresent
      imagePullSecrets: []
    labels:
      app.kubernetes.io/managed-by: npu-operator
      helm.sh/chart: npu-operator-0.0.0-latest
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Equal
      value: ""
    - effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
      operator: Equal
      value: ""
  devicePlugin:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/ascend-k8sdeviceplugin
      tag: v6.0.0
    initImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: hub.oepkgs.net/busybox:latest
      tag: ""
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/devicePlugin/devicePlugin.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  driver:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/npu-driver-installer
      tag: latest
    initImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: cr.openfuyao.cn/openfuyao/npu-driver-installer:latest
      tag: ""
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/driver/driver.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
    version: 24.1.RC3
  exporter:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/npu-exporter
      tag: v6.0.0
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/npu-exporter/npu-exporter.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  mindioacp:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/npu-node-provision
      tag: latest
    managed: false
    version: 6.0.0
  mindiotft:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/npu-node-provision
      tag: latest
    managed: false
  nodeD:
    heartbeatInterval: 5
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/noded
      tag: v6.0.0
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/noded/noded.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
    pollInterval: 60
  ociRuntime:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/npu-container-toolkit
      tag: latest
    initConfigImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: cr.openfuyao.cn/openfuyao/npu-container-toolkit:latest
      tag: ""
    initRuntimeImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: cr.openfuyao.cn/openfuyao/ascend-image/ascend-docker-runtime:latest
      tag: ""
    interval: 300
    managed: true
  operator:
    imageSpec:
      imagePullPolicy: IfNotPresent
      imagePullSecrets: []
    runtimeClass: ascend
  rscontroller:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/resilience-controller
      tag: v6.0.0
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/resilience-controller/run.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  trainer:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/ascend-operator
      tag: v6.0.0
    initImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: hub.oepkgs.net/library/busybox:latest
      tag: ""
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/ascend-operator/ascend-operator.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  vccontroller:
    controllerResources:
      limits:
        cpu: 1000m
        memory: 1Gi
      requests:
        cpu: 1000m
        memory: 1Gi
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/vc-controller-manager
      tag: v1.9.0-v6.0.0
    managed: true
  vcscheduler:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/vc-scheduler
      tag: v1.9.0-v6.0.0
    managed: true
    schedulerResources:
      limits:
        cpu: 200m
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 1Gi
status:
  componentStatuses:
  - name: /var/lib/npu-operator/components/driver
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/oci-runtime
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/device-plugin
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/trainer
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/noded
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/volcano/volcano-controller
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/volcano/volcano-scheduler
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/clusterd
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/resilience-controller
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/npu-exporter
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/mindio/mindiotft
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: ComponentUnmanaged
      type: unmanaged
  - name: /var/lib/npu-operator/components/mindio/mindioacp
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: ComponentUnmanaged
      type: unmanaged
  conditions:
  - lastTransitionTime: "2025-03-11T13:25:41Z"
    message: ""
    reason: Ready
    status: "False"
    type: Error
  - lastTransitionTime: "2025-03-11T13:25:41Z"
    message: all components have been successfully reconciled
    reason: Reconciled
    status: "True"
    type: Ready
  namespace: default
  phase: Ready

Manually Verify Installation Status and Running Results of Each Component

Driver Installation Status Verification

To check the driver firmware installation status, use a command like npu-smi info. If the output is similar to the following, it indicates the driver has been installed.

shell


 +------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B3               | OK            | 99.1        55                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          3162 / 65536         |
+===========================+===============+====================================================+
| 1     910B3               | OK            | 91.7        53                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          3162 / 65536         |
+===========================+===============+====================================================+
| 2     910B3               | OK            | 98.2        51                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          3162 / 65536         |
+===========================+===============+====================================================+
| 3     910B3               | OK            | 93.2        49                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          3162 / 65536         |
+===========================+===============+====================================================+
| 4     910B3               | OK            | 98.8        55                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          3163 / 65536         |
+===========================+===============+====================================================+
| 5     910B3               | OK            | 96.2        56                0    / 0             |
| 0                         | 0000:02:00.0  | 0           0    / 0          3163 / 65536         |
+===========================+===============+====================================================+
| 6     910B3               | OK            | 96.9        53                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          3162 / 65536         |
+===========================+===============+====================================================+
| 7     910B3               | OK            | 97.6        55                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          3163 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 0                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+

MindCluster Component Installation Status Verification
Use kubectl get pod -A to view all Pods. If all are in Running status, the components have started successfully. For more detailed verification of each component's function status, please refer to MindCluster Official Documentation.

bash


NAMESPACE                NAME                                                      READY   STATUS    RESTARTS         AGE
default                  ascend-runtime-containerd-7lg85                           1/1     Running   0                6m31s
default                  npu-driver-c4744                                          1/1     Running   0                6m31s
default                  npu-operator-77f56c9f6c-fhx8m                             1/1     Running   0                6m32s
default                  npu-feature-discovery-zqgt9                               1/1     Running   0                7m12s
default                  mindio-acp-43f64g63d2v                                    1/1     Running   0                7m21s
default                  mindio-tft-2cc35gs3c2u                                    1/1     Running   0                6m32s
kube-system              ascend-device-plugin-fm4h9                                1/1     Running   0                6m35s
mindx-dl                 ascend-operator-manager-6ff7468bd9-47d7s                  1/1     Running   0                6m50s
mindx-dl                 clusterd-5ffb8f6787-n5m82                                 1/1     Running   0                6m48s
mindx-dl                 noded-kmv8d                                               1/1     Running   0                7m11s 
mindx-dl                 resilience-controller-6727f36c28-wjn3s                    1/1     Running   0                7m20s  
npu-exporter             npu-exporter-b6txl                                        1/1     Running   0                7m22s
volcano-system           volcano-controllers-373749bg23c-mc9cq                     1/1     Running   0                7m31s 
volcano-system           volcano-scheduler-d585db88f-nkxch                         1/1     Running   0                7m40s

NOTE
The ascend-docker-runtime is installed as a plugin and registered with containerd. To use this feature, specify the runtime as ascend-docker-runtime when starting a container, or specify the runtimeClassName as ascend when creating Kubernetes resources. Example:
ctr run --runtime io.containerd.runc.v2 --runc-binary /var/lib/npu-container-toolkit/runtime/ascend-docker-runtime -t \
--env ASCEND_VISIBLE_DEVICES=0 ubuntu:22.04 <container_id>

Uninstallation

Execute the following steps to uninstall Operator.

Execute the following command to delete Operator through Helm CLI or application management interface.
shell
```
helm delete <npu-operator release name>
```

By default, Helm does not support deleting existing CRDs when deleting Charts.

shell

kubectl get crd npuclusterpolicies.npu.openfuyao.com

Execute the following command to manually delete CRD.

shell

kubectl delete crd npuclusterpolicies.npu.openfuyao.com

Note:
After Operator uninstallation, the driver may still exist on the host machine.

Component Installation and Uninstallation Instructions

Component Installation and Uninstallation Fields

When installing NPU Operator for the first time, if the enabled field for the corresponding component in values.yaml is set to true, it will replace the component resources managed by NPU Operator regardless of whether the component resources existed in the cluster before.
If the component already exists in the cluster environment (e.g., volcano-controller), and the component's enabled field in values.yaml is set to false during the first installation of NPU Operator, the existing component resources in the cluster will not be deleted.
After NPU Operator installation is complete, modifying the corresponding fields of the CR instance can complete operations such as component image address, resource configuration, and lifecycle management.

MindIO Installation Dependencies

If users need to install MindIO related components, they need to install the python environment in the node environment in advance (including pip3 tool). Supported python versions are 3.7-3.11, otherwise normal installation cannot be performed. When using, users can mount the installed corresponding SDK to the training container for use.
MindIO TFT (Training Fault Tolerance) component installation path is /opt/sdk/tft, and we provide whl packages for different python versions in /opt/tft-whl-package to meet users' customized needs. For specific usage, please refer to Fault Recovery Acceleration.
MindIO ACP (Async Checkpoint Persistence) component installation path is /opt/mindio and /opt/sdk/acp, and we provide whl packages for different python versions in /opt/acp-whl-package for users to install according to their needs. For specific usage instructions, please refer to Checkpoint Save and Load Optimization.
When uninstalling MindIO components, the SDK folders related to MindIO components will be cleared, which may cause exceptions in task containers using the service. Please operate with caution.

Helm chart values.yaml Special Field Instructions

The driver.env field adds container environment variables for npu-driver-installer. The environment variable corresponding to "HOST_DRIVER_SOURCE_PATH" is the path where the zip package needs to be placed for offline driver firmware zip installation. The current default path is "/tmp/driver_pkg". The environment variable corresponding to "DRIVER_VERSION" is the driver firmware version number, and the default value is "25.3.RC1".
The trainer.commandSpec field, taking the ascend-operator component as an example, the commandSpec field provides container startup command configuration for the component, which can be modified, such as setting log level, log path, component startup parameters, etc. For details, please refer to the parameter descriptions of each component in MindCluster Documentation. ascend-device-plugin, npu-exporter, volcano, clusterd, noded and other components all contain this field, and different startup parameters can be configured.
The trainer.resources field, taking the ascend-operator component as an example, the resources field provides resource request configuration for the component container. If there is no special need, please apply according to the default configuration, and can dynamically adjust according to specific business and cluster resource conditions. ascend-device-plugin, npu-exporter, volcano, clusterd, noded and other components all contain this field.

View source on GitCode

Bug

NPU Operator ​

Feature Introduction ​

Component Version Compatibility ​

Application Scenarios ​

Capability Scope ​

Highlight Features ​

Implementation Principle ​

Relationship with Related Features ​

Operator Security Context ​

Installation ​

openFuyao Platform Deployment ​

Online Installation ​

Prerequisites ​

Installation Steps ​

Offline Installation ​

Prerequisites ​

Installation Steps ​

Standalone Deployment ​

Online Installation ​

Prerequisites ​

Installation Steps ​

Common Customization Options ​

Offline Installation ​

Prerequisites ​

Installation Steps ​

Upgrade ​

NPU Policy CR Update ​

Installation Status Verification ​

View Component Status via CustomResource ​

Manually Verify Installation Status and Running Results of Each Component ​

Uninstallation ​

Component Installation and Uninstallation Instructions ​

Component Installation and Uninstallation Fields ​

MindIO Installation Dependencies ​

Helm chart values.yaml Special Field Instructions ​

NPU Operator

Feature Introduction

Component Version Compatibility

Application Scenarios

Capability Scope

Highlight Features

Implementation Principle

Relationship with Related Features

Operator Security Context

Installation

openFuyao Platform Deployment

Online Installation

Prerequisites

Installation Steps

Offline Installation

Prerequisites

Installation Steps

Standalone Deployment

Online Installation

Prerequisites

Installation Steps

Common Customization Options

Offline Installation

Prerequisites

Installation Steps

Upgrade

NPU Policy CR Update

Installation Status Verification

View Component Status via CustomResource

Manually Verify Installation Status and Running Results of Each Component

Uninstallation

Component Installation and Uninstallation Instructions

Component Installation and Uninstallation Fields

MindIO Installation Dependencies

Helm chart values.yaml Special Field Instructions