NPU Operator
Feature Introduction
Kubernetes provides access to special hardware resources (such as Ascend NPU) through Device Plugin. However, configuring and managing nodes with these hardware resources requires configuring multiple software components (such as drivers, container runtimes, or other libraries), which are complex and error-prone to install. NPU Operator uses the Operator Framework in Kubernetes to automatically manage all software components required for configuring Ascend devices. These components include Ascend Driver and Firmware, MindCluster device plugins that enable cluster-wide operations, support for cluster job scheduling, operations monitoring, and fault recovery. By installing the corresponding components, NPU resource management, workload optimized scheduling, and containerized support for training and inference tasks can be achieved, enabling AI jobs to be deployed and run as containers on NPU devices.
Table 1 Currently Supported Components
| Component Name | Deployment Method | Component Function |
|---|---|---|
| Ascend Driver and Firmware | Containerized deployment managed by NPU Operator | Acts as a bridge between hardware devices and the operating system, allowing the operating system to recognize and communicate with hardware devices. |
| Ascend Device Plugin | Containerized deployment managed by NPU Operator | Device Discovery: Based on the Kubernetes device plugin mechanism, adds device discovery, device allocation, and device health status reporting functions for Ascend AI processors, enabling Kubernetes to manage Ascend AI processor resources. |
| Ascend Operator | Containerized deployment managed by NPU Operator | Environment Configuration: Volcano coordination component, responsible for managing acjob type tasks, injecting environment variables required by AI frameworks (MindSpore/PyTorch/TensorFlow) training tasks into containers, then Volcano takes over scheduling. |
| Ascend Docker Runtime | Containerized deployment managed by NPU Operator | Ascend Container Runtime: Container engine plugin, provides NPU containerization support for all AI jobs, enabling users to smoothly run AI jobs as Docker containers on Ascend devices. |
| NPU Exporter | Containerized deployment managed by NPU Operator | Real-time monitoring of Ascend AI processor resource data: This function supports real-time collection of various resource data from Ascend AI processors, including processor utilization, temperature, voltage, and memory usage. Additionally, it can monitor vNPU of Atlas inference series products, including key metrics such as AI Core utilization, vNPU total memory, and used memory. |
| Resilience Controller | Containerized deployment managed by NPU Operator | Dynamic Scaling: When a fault occurs during task training and there are insufficient healthy resources for replacement, this component can use dynamic scaling to remove faulty resources and continue training. After sufficient resources become available, training tasks can be resumed through dynamic scaling. |
| ClusterD | Containerized deployment managed by NPU Operator | Collects cluster task information, resource information, and fault information, uniformly determines fault handling levels and strategies, and controls process recomputation of training containers. |
| Volcano | Containerized deployment managed by NPU Operator | Obtains cluster resource information from underlying components, selects optimal scheduling strategies and resource allocation by sensing the network connection methods between Ascend chips, and can perform task rescheduling when task resource faults occur. |
| NodeD | Containerized deployment managed by NPU Operator | Detects node resource monitoring status and node fault information, reports fault information, and prevents new tasks from being scheduled on faulty nodes. |
| MindIO | Containerized deployment managed by NPU Operator | Handles the generation and saving of terminal CheckPoint after model training interruption, online repair of UCE faults in on-chip memory during model training, provides the ability to complete fault repair and model resumption training through restart or node replacement, and optimizes CheckPoint saving and loading. |
For detailed information about components, please refer to MindCluster Introduction.
Component Version Compatibility
| Component Name | Version |
|---|---|
| Ascend Driver and Firmware | 25.3.RC1 |
| Ascend Device Plugin | 7.2.RC1 |
| Ascend Operator | 7.2.RC1 |
| Ascend Docker Runtime | 7.2.RC1 |
| NPU Exporter | 7.2.RC1 |
| Resilience Controller | 7.2.RC1 |
| ClusterD | 7.2.RC1 |
| Volcano | 7.2.RC1 (based on original Volcano 1.9.0) |
| NodeD | 7.2.RC1 |
| MindIO | 7.2.RC1 |
Application Scenarios
Building clusters based on Ascend devices, supporting cluster job scheduling, operations monitoring, and fault recovery scenarios. NPU Operator can automatically identify Ascend nodes in the cluster and perform corresponding installation and deployment work. For training scenarios, it supports NPU resource detection, full card scheduling, static vNPU scheduling, resuming training from breakpoint, and elastic training. For inference scenarios, it supports resource detection, full card scheduling, static vNPU scheduling, dynamic vNPU scheduling, inference card fault recovery, and rescheduling functions.
Capability Scope
- Automatically discover Ascend NPU device nodes and label the nodes.
- Automatically deploy Ascend NPU driver firmware.
- MindCluster automated deployment installation and lifecycle management for cluster scheduling components.
Highlight Features
NPU Operator can automatically identify Ascend nodes and device models in the cluster, and install the corresponding versions of necessary components for AI runtime, greatly simplifying the threshold for configuring Ascend ecosystem components. It provides full lifecycle management and automated configuration deployment for installed components. NPU Operator can detect component installation status and provide detailed logs for debugging.
Implementation Principle
- Operator monitors CRD instantiated CR changes to modify managed component states.
- Operator uses labels marked by NFD on nodes, utilizing the npu-feature-discovery component to label nodes with labels suitable for Ascend component scheduling.
Relationship with Related Features
Please ensure that application-management-service and marketplace-service are running normally to ensure this feature can be installed normally from the application market.
Operator Security Context
Some NPU Operator managed Pods (such as driver containers) require elevated privileges as follows.
privileged: truehostPID: truehostIPC: truehostNetwork: true
Reasons for elevated privileges:
- Access host file system and hardware devices, install driver firmware and SDK services on the host machine.
- Modify device permissions to adapt for non-root users.
Installation
openFuyao Platform Deployment
Online Installation
Prerequisites
kubectl and Helm CLI are available on the current computer, or there is a configurable application store or repository in the cluster.
Please confirm that the environment contains the bash tool, otherwise the driver firmware installation script parsing may fail.
All worker nodes or node groups running NPU workloads in the Kubernetes cluster must have an operating system version that meets openEuler 22.03 LTS or Ubuntu 22.04 (ARM architecture).
For worker nodes or node groups that only run CPU workloads, nodes can run any operating system, because NPU Operator will not perform any configuration or management on nodes with non-NPU workloads.
The components installed by the current NPU Operator require the running environment to meet NPU chip models 910B and 310P. For specific OS and hardware compatibility, please refer to MindCluster Documentation.
Node Feature Discovery (NFD) and NPU Feature Discovery (NPU-Feature-Discovery) are dependencies of Operator on each node.
Note:
By default, NFD master and worker nodes are automatically deployed by Operator. If NFD is already running in the cluster, you must disable NFD deployment when installing Operator. Similarly, if NPU-Feature-Discovery has already been deployed to the cluster, you also need to disable NPU-Feature-Discovery deployment when installing Operator.values.yaml
yamlnfd: enabled: false npu-feature-discovery: enabled: falseCheck NFD labels on nodes to determine if NFD is already running in the cluster.
shkubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'If the command output is
true, NFD is already running in the cluster. In this case, setnodefeaturerulesto install NPU custom node discovery rules.yamlnfd: nodefeaturerules: trueBy default, nfd is true, npu-feature-discovery is true, and nodefeaturerules is false.
Installation Steps
NPU Operator extension component can be downloaded and installed from the openFuyao application market.
Enter the openFuyao platform and select "Application Market > Application List" from the left navigation bar.
Search for "npu-operator" in the application list to find the NPU Operator extension component.
Click the NPU Operator card to enter the application details page.
On the details page, click "Deploy" in the upper right corner, and enter "Application Name", "Version Information", and "Namespace" in the "Installation Information" module on the deployment interface.
Click "Confirm" to successfully deploy the component.
Note:
Currently, the online installation function supports driver firmware installation for 910B and 310P series models. Online installation of 910C model driver firmware is not currently supported. If you need to install 910C model driver firmware, please refer to the Offline Installation section. When installing and deploying the npu-operator component through the application market, you can modify the corresponding values.yaml parameters. For details, please refer to Table 2.
Offline Installation
Prerequisites
- Please refer to Online Installation Prerequisites.
- Download offline images: Download all images used by the components to be installed locally.
- Prepare driver firmware zip package and MindIO component zip package:
- Download driver firmware zip package: Go to npu-driver-installer's repository, find the config.json file for the corresponding driver firmware version, and click the corresponding link according to the NPU model and OS architecture of the corresponding node to download the corresponding driver firmware zip package; 910C model NPU supports offline driver firmware installation, which needs to be downloaded separately from Ascend Community - Firmware and Driver Download. Driver firmware zip packages for other models can also be downloaded from the above link.
- Download MindIO component zip package: Go to npu-node-provision's repository, find the config.json file for the corresponding component version, and click the corresponding link according to the NPU model and OS architecture of the corresponding node to download the corresponding SDK zip package.
- Place the driver firmware zip file in the node path for offline installation:
/tmp/driver_pkg/. This path can be customized. For specific modification methods, please refer to thedriver.envfield in Table 2. - Place the MindIO component zip package in the node path for offline installation:
/opt/openFuyao/mindio/ - Check if the following tools exist in the nodes to be installed:
- If the package management tool is yum, the packages to be installed are: "jq wget unzip which initscripts coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch kernel-devel-
(uname -r) dkms" - If the package management tool is apt-get, the packages to be installed are: "jq wget unzip debianutils coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch dkms linux-headers-$(uname -r)"
- If the package management tool is dnf, the packages to be installed are: "jq wget unzip which initscripts coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch kernel-devel-
(uname -r) dkms"
- If the package management tool is yum, the packages to be installed are: "jq wget unzip which initscripts coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch kernel-devel-
Installation Steps
Please refer to Online Installation Steps.
Standalone Deployment
Online Installation
Prerequisites
Please refer to openFuyao Platform Deployment Prerequisites.
Installation Steps
Add the openFuyao Helm repository.
shellhelm repo add openfuyao https://helm.openfuyao.cn && helm repo updateInstall NPU Operator.
Install Operator using default configuration:
shellhelm install --wait --generate-name \ -n default --create-namespace \ openfuyao/npu-operator
If installing via Helm application store or application market, add the https://helm.openfuyao.cn repository and install through the interface. For details, please refer to Common Customization Options.
Note:
- After installing NPU Operator, labels related to NPU resources will be added to nodes according to different node environments. These labels are related to cluster scheduling components. The accelerate-type label requires the node's hardware server to fully match the NPU card. For specific matching relationships, please refer to MindCluster Documentation's Creating Node Labels.
- For A800I A2 inference servers, automatic addition of server-usage=infer label is not currently supported. Users need to manually add it using the following command.
kubectl label nodes <node-name> server-usage=inferCommon Customization Options
When using Helm Chart, the following options can be modified. These options are used during Helm installation through --set or --set-json (used when modifying component environment variables, volumes, and other list structures). Due to the large number of configuration items, it is recommended to directly modify the corresponding fields in the values.yaml file in the chart package.
Table 2 lists the most commonly used fields. For other fields, please refer to the repository.
| Scope | Description | Default |
|---|---|---|
nfd.enabled | Deploy Node Feature Discovery service NFD. If NFD is already running in the cluster, set this variable to false.Note: | true |
nfd.nodefeaturerules | When set to true, install NFD discovered NPU device rules CR. | true |
node-feature-discovery.image.repository | NFD service image address. | registry.k8s.io/nfd/node-feature-discovery |
node-feature-discovery.image.pullPolicy | NFD service image pull policy. | Always |
node-feature-discovery.image.tag | NFD service image version. | v0.16.4 |
npu-feature-discovery.images.core.repository | NPU-Feature-Discovery image address. | cr.openfuyao.cn/openfuyao/npu-feature-discovery |
npu-feature-discovery.images.core.pullPolicy | NPU-Feature-Discovery image pull policy. | Always |
npu-feature-discovery.images.core.tag | NPU-Feature-Discovery image version. | latest |
npu-feature-discovery.enabled | Switch for deploying NPU-Feature-Discovery. If NPU-Feature-Discovery is already running in the cluster, set this variable to false. | true |
images.operator.repository | NPU Operator image address. | cr.openfuyao.cn/openfuyao/npu-operator |
images.operator.tag | NPU Operator image version. | latest |
images.operator.pullPolicy | NPU Operator image pull policy. | Always |
daemonSets.labels | Custom labels to add to all NPU Operator managed Pods. | {} |
daemonSets.tolerations | Custom tolerations to add to all NPU Operator managed Pods. | [] |
driver.enabled | By default, Operator deploys NPU driver firmware program as a container on the system. | true |
images.driver.repository | Driver firmware program image storage address. | cr.openfuyao.cn/openfuyao/npu-driver-installer |
images.driver.tag | Driver firmware installation service image version. | latest |
driver.env | Environment variables related to driver firmware installation service 1. name: HOST_DRIVER_SOURCE_PATH value: "/tmp/driver_pkg" 2. name: DRIVER_VERSION value: "25.3.RC1" | 1. This environment variable specifies the placement path for the driver firmware zip package 2. This environment variable specifies the online download installation version for the driver firmware zip package. |
devicePlugin.enabled | By default, Operator deploys NPU device plugin program on the system. When using Operator on a system with pre-installed device plugins, please set this value to false.Note: | true |
images.devicePlugin.repository | Device plugin program image address. | hub.oepkgs.net/openfuyao/ascendhub/ascend-k8sdeviceplugin |
images.devicePlugin.tag | Device plugin service image version. | v7.2.RC1 |
trainer.enabled | By default, Operator installs Ascend operator. If installation is not needed, set this value to false. | true |
images.trainer.repository | Ascend operator image address. | hub.oepkgs.net/openfuyao/ascendhub/ascend-operator |
images.trainer.tag | Ascend operator image version. | v7.2.RC1 |
ociRuntime.enabled | By default, Operator installs Ascend Docker Runtime. If installation is not needed, set this value to false. | true |
images.ociRuntime.repository | Ascend Docker Runtime image address. | cr.openfuyao.cn/openfuyao/npu-container-toolkit |
images.ociRuntime.tag | Ascend Docker Runtime image version. | latest |
nodeD.enabled | By default, Operator installs nodeD. If installation is not needed, set this value to false. | true |
images.nodeD.repository | nodeD image address. | hub.oepkgs.net/openfuyao/ascendhub/noded |
images.nodeD.tag | nodeD image version. | v7.2.RC1 |
clusterd.enabled | By default, Operator installs clusterD component. If installation is not needed, set this value to false. | true |
images.clusterd.repository | clusterD image address. | hub.oepkgs.net/openfuyao/ascendhub/clusterd |
images.clusterd.tag | clusterD image version. | v7.2.RC1 |
rscontroller.enabled | By default, Operator installs resilience controller component. If installation is not needed, set this value to false. | true |
images.rscontroller.repository | resilience controller image address. | hub.oepkgs.net/openfuyao/ascendhub/resilience-controller |
images.rscontroller.tag | resilience controller image version. | v7.1.RC1 |
exporter.enabled | By default, Operator installs NPU Exporter component. If installation is not needed, set this value to false. | true |
images.exporter.repository | NPU Exporter image address. | cr.openfuyao.cn/openfuyao/npu-exporter |
images.exporter.tag | NPU Exporter image version. | v7.2.RC1-of.1 |
mindiotft.enabled | By default, Operator installs MindIO Training Fault Tolerance. If installation is not needed, set this value to false. | true |
images.mindiotft.repository | MindIO Training Fault Tolerance image address. | cr.openfuyao.cn/openfuyao/npu-node-provision |
images.mindiotft.tag | MindIO Training Fault Tolerance image version. | latest |
mindioacp.enabled | By default, Operator installs MindIO Async Checkpoint Persistence. If installation is not needed, set this value to false. | true |
images.mindioacp.repository | MindIO Async Checkpoint Persistence service image address. | cr.openfuyao.cn/openfuyao/npu-node-provision |
images.mindioacp.tag | MindIO Async Checkpoint Persistence service image version. | latest |
mindioacp.version | MindIO Async Checkpoint Persistence version. | 7.2.RC1 |
vccontroller.enabled | By default, Operator installs volcano-controller. If installation is not needed, set this value to false. | true |
images.vccontroller.repository | volcano-controller service image address. | hub.oepkgs.net/openfuyao/ascendhub/vc-controller-manager |
images.vccontroller.tag | volcano-controller service image version. | v1.9.0-v7.2.RC1 |
vcscheduler.enabled | By default, Operator installs volcano-scheduler. If installation is not needed, set this value to false. | true |
images.vcscheduler.repository | volcano-scheduler service image address. | hub.oepkgs.net/openfuyao/ascendhub/vc-scheduler |
images.vcscheduler.tag | volcano-scheduler service image version. | v1.9.0-v7.2.RC1 |
Offline Installation
Prerequisites
Please refer to openFuyao Platform Deployment Prerequisites.
Installation Steps
Prepare the NPU Operator chart package in advance.
Install Operator using default configuration:
shellcd npu-operator/charts helm install <npu-operator release name> npu-operatorFor common customization options, see Table 2.
Upgrade
NPU Operator supports dynamic updates to existing resources. This feature enables NPU Operator to ensure that NPU Policy settings in the cluster are always up to date.
Since Helm does not support automatic upgrades of existing CRDs, you can manually or enable Helm Hooks to upgrade NPU Operator Chart.
NPU Policy CR Update
NPU Operator supports dynamic updates to npuclusterpolicy CustomResource using kubectl.
kubectl get npuclusterpolicy -A
# If the default npuclusterpolicy has not been modified, the default name of npuclusterpolicy is cluster.
kubectl edit npuclusterpolicy clusterAfter editing, Kubernetes automatically applies the updates to the cluster. All components managed by NPU Operator will also be updated to the expected state.
Installation Status Verification
View Component Status via CustomResource
View component status through CustomResource npuclusterpolicies.npu.openfuyao.com. The specific method is to check the state field of each component in the status field to confirm the current status of the component. Below is an example of the driver installer running normally.
status:
componentStatuses:
- name: /var/lib/npu-operator/components/driver
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running- View CustomResource
$ kubectl get npuclusterpolicies.npu.openfuyao.com cluster -o yaml
apiVersion: npu.openfuyao.com/v1
kind: NPUClusterPolicy
metadata:
annotations:
meta.helm.sh/release-name: npu
meta.helm.sh/release-namespace: default
creationTimestamp: "2025-03-11T13:22:39Z"
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: npu-operator
name: cluster
resourceVersion: "2240086"
uid: 0d1498c5-143a-4e05-a5dc-376d2e6c96ea
spec:
clusterd:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/clusterd
tag: v6.0.0
logRotate:
compress: false
logFile: /var/log/mindx-dl/clusterd/clusterd.log
logLevel: info
maxAge: 7
rotate: 30
managed: true
daemonsets:
imageSpec:
imagePullPolicy: IfNotPresent
imagePullSecrets: []
labels:
app.kubernetes.io/managed-by: npu-operator
helm.sh/chart: npu-operator-0.0.0-latest
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Equal
value: ""
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Equal
value: ""
devicePlugin:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/ascend-k8sdeviceplugin
tag: v6.0.0
initImageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: ""
repository: hub.oepkgs.net/busybox:latest
tag: ""
logRotate:
compress: false
logFile: /var/log/mindx-dl/devicePlugin/devicePlugin.log
logLevel: info
maxAge: 7
rotate: 30
managed: true
driver:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/npu-driver-installer
tag: latest
initImageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: ""
repository: cr.openfuyao.cn/openfuyao/npu-driver-installer:latest
tag: ""
logRotate:
compress: false
logFile: /var/log/mindx-dl/driver/driver.log
logLevel: info
maxAge: 7
rotate: 30
managed: true
version: 24.1.RC3
exporter:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/npu-exporter
tag: v6.0.0
logRotate:
compress: false
logFile: /var/log/mindx-dl/npu-exporter/npu-exporter.log
logLevel: info
maxAge: 7
rotate: 30
managed: true
mindioacp:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/npu-node-provision
tag: latest
managed: false
version: 6.0.0
mindiotft:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/npu-node-provision
tag: latest
managed: false
nodeD:
heartbeatInterval: 5
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/noded
tag: v6.0.0
logRotate:
compress: false
logFile: /var/log/mindx-dl/noded/noded.log
logLevel: info
maxAge: 7
rotate: 30
managed: true
pollInterval: 60
ociRuntime:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/npu-container-toolkit
tag: latest
initConfigImageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: ""
repository: cr.openfuyao.cn/openfuyao/npu-container-toolkit:latest
tag: ""
initRuntimeImageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: ""
repository: cr.openfuyao.cn/openfuyao/ascend-image/ascend-docker-runtime:latest
tag: ""
interval: 300
managed: true
operator:
imageSpec:
imagePullPolicy: IfNotPresent
imagePullSecrets: []
runtimeClass: ascend
rscontroller:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/resilience-controller
tag: v6.0.0
logRotate:
compress: false
logFile: /var/log/mindx-dl/resilience-controller/run.log
logLevel: info
maxAge: 7
rotate: 30
managed: true
trainer:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/ascend-operator
tag: v6.0.0
initImageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: ""
repository: hub.oepkgs.net/library/busybox:latest
tag: ""
logRotate:
compress: false
logFile: /var/log/mindx-dl/ascend-operator/ascend-operator.log
logLevel: info
maxAge: 7
rotate: 30
managed: true
vccontroller:
controllerResources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 1000m
memory: 1Gi
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/vc-controller-manager
tag: v1.9.0-v6.0.0
managed: true
vcscheduler:
imageSpec:
imagePullPolicy: Always
imagePullSecrets: []
registry: cr.openfuyao.cn
repository: openfuyao/ascend-image/vc-scheduler
tag: v1.9.0-v6.0.0
managed: true
schedulerResources:
limits:
cpu: 200m
memory: 1Gi
requests:
cpu: 200m
memory: 1Gi
status:
componentStatuses:
- name: /var/lib/npu-operator/components/driver
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/oci-runtime
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/device-plugin
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/trainer
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/noded
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/volcano/volcano-controller
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/volcano/volcano-scheduler
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/clusterd
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/resilience-controller
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/npu-exporter
prevState:
reason: Reconciling
type: deploying
state:
reason: Reconciled
type: running
- name: /var/lib/npu-operator/components/mindio/mindiotft
prevState:
reason: Reconciling
type: deploying
state:
reason: ComponentUnmanaged
type: unmanaged
- name: /var/lib/npu-operator/components/mindio/mindioacp
prevState:
reason: Reconciling
type: deploying
state:
reason: ComponentUnmanaged
type: unmanaged
conditions:
- lastTransitionTime: "2025-03-11T13:25:41Z"
message: ""
reason: Ready
status: "False"
type: Error
- lastTransitionTime: "2025-03-11T13:25:41Z"
message: all components have been successfully reconciled
reason: Reconciled
status: "True"
type: Ready
namespace: default
phase: ReadyManually Verify Installation Status and Running Results of Each Component
Driver Installation Status Verification
To check the driver firmware installation status, use a command like
npu-smi info. If the output is similar to the following, it indicates the driver has been installed.shell+------------------------------------------------------------------------------------------------+ | npu-smi 24.1.rc2 Version: 24.1.rc2 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 0 910B3 | OK | 99.1 55 0 / 0 | | 0 | 0000:C1:00.0 | 0 0 / 0 3162 / 65536 | +===========================+===============+====================================================+ | 1 910B3 | OK | 91.7 53 0 / 0 | | 0 | 0000:C2:00.0 | 0 0 / 0 3162 / 65536 | +===========================+===============+====================================================+ | 2 910B3 | OK | 98.2 51 0 / 0 | | 0 | 0000:81:00.0 | 0 0 / 0 3162 / 65536 | +===========================+===============+====================================================+ | 3 910B3 | OK | 93.2 49 0 / 0 | | 0 | 0000:82:00.0 | 0 0 / 0 3162 / 65536 | +===========================+===============+====================================================+ | 4 910B3 | OK | 98.8 55 0 / 0 | | 0 | 0000:01:00.0 | 0 0 / 0 3163 / 65536 | +===========================+===============+====================================================+ | 5 910B3 | OK | 96.2 56 0 / 0 | | 0 | 0000:02:00.0 | 0 0 / 0 3163 / 65536 | +===========================+===============+====================================================+ | 6 910B3 | OK | 96.9 53 0 / 0 | | 0 | 0000:41:00.0 | 0 0 / 0 3162 / 65536 | +===========================+===============+====================================================+ | 7 910B3 | OK | 97.6 55 0 / 0 | | 0 | 0000:42:00.0 | 0 0 / 0 3163 / 65536 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 0 | +===========================+===============+====================================================+ | No running processes found in NPU 1 | +===========================+===============+====================================================+ | No running processes found in NPU 2 | +===========================+===============+====================================================+ | No running processes found in NPU 3 | +===========================+===============+====================================================+ | No running processes found in NPU 4 | +===========================+===============+====================================================+ | No running processes found in NPU 5 | +===========================+===============+====================================================+ | No running processes found in NPU 6 | +===========================+===============+====================================================+ | No running processes found in NPU 7 | +===========================+===============+====================================================+MindCluster Component Installation Status Verification
Use
kubectl get pod -Ato view all Pods. If all are in Running status, the components have started successfully. For more detailed verification of each component's function status, please refer to MindCluster Official Documentation.
NAMESPACE NAME READY STATUS RESTARTS AGE
default ascend-runtime-containerd-7lg85 1/1 Running 0 6m31s
default npu-driver-c4744 1/1 Running 0 6m31s
default npu-operator-77f56c9f6c-fhx8m 1/1 Running 0 6m32s
default npu-feature-discovery-zqgt9 1/1 Running 0 7m12s
default mindio-acp-43f64g63d2v 1/1 Running 0 7m21s
default mindio-tft-2cc35gs3c2u 1/1 Running 0 6m32s
kube-system ascend-device-plugin-fm4h9 1/1 Running 0 6m35s
mindx-dl ascend-operator-manager-6ff7468bd9-47d7s 1/1 Running 0 6m50s
mindx-dl clusterd-5ffb8f6787-n5m82 1/1 Running 0 6m48s
mindx-dl noded-kmv8d 1/1 Running 0 7m11s
mindx-dl resilience-controller-6727f36c28-wjn3s 1/1 Running 0 7m20s
npu-exporter npu-exporter-b6txl 1/1 Running 0 7m22s
volcano-system volcano-controllers-373749bg23c-mc9cq 1/1 Running 0 7m31s
volcano-system volcano-scheduler-d585db88f-nkxch 1/1 Running 0 7m40sNOTE
The ascend-docker-runtime is installed as a plugin and registered with containerd. To use this feature, specify the runtime as ascend-docker-runtime when starting a container, or specify the runtimeClassName as ascend when creating Kubernetes resources. Example:
ctr run --runtime io.containerd.runc.v2 --runc-binary /var/lib/npu-container-toolkit/runtime/ascend-docker-runtime -t \
--env ASCEND_VISIBLE_DEVICES=0 ubuntu:22.04 <container_id>
Uninstallation
Execute the following steps to uninstall Operator.
Execute the following command to delete Operator through Helm CLI or application management interface.
shellhelm delete <npu-operator release name>
By default, Helm does not support deleting existing CRDs when deleting Charts.
kubectl get crd npuclusterpolicies.npu.openfuyao.com- Execute the following command to manually delete CRD.
kubectl delete crd npuclusterpolicies.npu.openfuyao.comNote:
After Operator uninstallation, the driver may still exist on the host machine.
Component Installation and Uninstallation Instructions
Component Installation and Uninstallation Fields
When installing NPU Operator for the first time, if the enabled field for the corresponding component in values.yaml is set to
true, it will replace the component resources managed by NPU Operator regardless of whether the component resources existed in the cluster before.If the component already exists in the cluster environment (e.g., volcano-controller), and the component's
enabledfield in values.yaml is set tofalseduring the first installation of NPU Operator, the existing component resources in the cluster will not be deleted.After NPU Operator installation is complete, modifying the corresponding fields of the CR instance can complete operations such as component image address, resource configuration, and lifecycle management.
MindIO Installation Dependencies
If users need to install MindIO related components, they need to install the python environment in the node environment in advance (including pip3 tool). Supported python versions are 3.7-3.11, otherwise normal installation cannot be performed. When using, users can mount the installed corresponding SDK to the training container for use.
MindIO TFT (Training Fault Tolerance) component installation path is /opt/sdk/tft, and we provide whl packages for different python versions in /opt/tft-whl-package to meet users' customized needs. For specific usage, please refer to Fault Recovery Acceleration.
MindIO ACP (Async Checkpoint Persistence) component installation path is /opt/mindio and /opt/sdk/acp, and we provide whl packages for different python versions in /opt/acp-whl-package for users to install according to their needs. For specific usage instructions, please refer to Checkpoint Save and Load Optimization.
When uninstalling MindIO components, the SDK folders related to MindIO components will be cleared, which may cause exceptions in task containers using the service. Please operate with caution.
Helm chart values.yaml Special Field Instructions
- The
driver.envfield adds container environment variables for npu-driver-installer. The environment variable corresponding to "HOST_DRIVER_SOURCE_PATH" is the path where the zip package needs to be placed for offline driver firmware zip installation. The current default path is "/tmp/driver_pkg". The environment variable corresponding to "DRIVER_VERSION" is the driver firmware version number, and the default value is "25.3.RC1". - The
trainer.commandSpecfield, taking the ascend-operator component as an example, thecommandSpecfield provides container startup command configuration for the component, which can be modified, such as setting log level, log path, component startup parameters, etc. For details, please refer to the parameter descriptions of each component in MindCluster Documentation. ascend-device-plugin, npu-exporter, volcano, clusterd, noded and other components all contain this field, and different startup parameters can be configured. - The
trainer.resourcesfield, taking the ascend-operator component as an example, theresourcesfield provides resource request configuration for the component container. If there is no special need, please apply according to the default configuration, and can dynamically adjust according to specific business and cluster resource conditions. ascend-device-plugin, npu-exporter, volcano, clusterd, noded and other components all contain this field.
