Version: v26.03

NPU Operator

Feature Introduction

Kubernetes provides access to special hardware resources (such as Ascend NPU) through Device Plugin. However, configuring and managing nodes with these hardware resources requires configuring multiple software components (such as drivers, container runtimes, or other libraries), which are complex and error-prone to install. NPU Operator uses the Operator Framework in Kubernetes to automatically manage all software components required for configuring Ascend devices. These components include Ascend Driver and Firmware, MindCluster device plugins that enable cluster-wide operations, support for cluster job scheduling, operations monitoring, and fault recovery. By installing the corresponding components, NPU resource management, workload optimized scheduling, and containerized support for training and inference tasks can be achieved, enabling AI jobs to be deployed and run as containers on NPU devices.

Table 1 Currently Supported Components

Component NameDeployment MethodComponent Function
Ascend Driver and FirmwareContainerized deployment managed by NPU OperatorActs as a bridge between hardware devices and the operating system, allowing the operating system to recognize and communicate with hardware devices.
Ascend Device PluginContainerized deployment managed by NPU OperatorDevice Discovery: Based on the Kubernetes device plugin mechanism, adds device discovery, device allocation, and device health status reporting functions for Ascend AI processors, enabling Kubernetes to manage Ascend AI processor resources.
Ascend OperatorContainerized deployment managed by NPU OperatorEnvironment Configuration: Volcano coordination component, responsible for managing acjob type tasks, injecting environment variables required by AI frameworks (MindSpore/PyTorch/TensorFlow) training tasks into containers, then Volcano takes over scheduling.
Ascend Docker RuntimeContainerized deployment managed by NPU OperatorAscend Container Runtime: Container engine plugin, provides NPU containerization support for all AI jobs, enabling users to smoothly run AI jobs as Docker containers on Ascend devices.
NPU ExporterContainerized deployment managed by NPU OperatorReal-time monitoring of Ascend AI processor resource data: This function supports real-time collection of various resource data from Ascend AI processors, including processor utilization, temperature, voltage, and memory usage. Additionally, it can monitor vNPU of Atlas inference series products, including key metrics such as AI Core utilization, vNPU total memory, and used memory.
Resilience ControllerContainerized deployment managed by NPU OperatorDynamic Scaling: When a fault occurs during task training and there are insufficient healthy resources for replacement, this component can use dynamic scaling to remove faulty resources and continue training. After sufficient resources become available, training tasks can be resumed through dynamic scaling.
ClusterDContainerized deployment managed by NPU OperatorCollects cluster task information, resource information, and fault information, uniformly determines fault handling levels and strategies, and controls process recomputation of training containers.
VolcanoContainerized deployment managed by NPU OperatorObtains cluster resource information from underlying components, selects optimal scheduling strategies and resource allocation by sensing the network connection methods between Ascend chips, and can perform task rescheduling when task resource faults occur.
NodeDContainerized deployment managed by NPU OperatorDetects node resource monitoring status and node fault information, reports fault information, and prevents new tasks from being scheduled on faulty nodes.
MindIOContainerized deployment managed by NPU OperatorHandles the generation and saving of terminal CheckPoint after model training interruption, online repair of UCE faults in on-chip memory during model training, provides the ability to complete fault repair and model resumption training through restart or node replacement, and optimizes CheckPoint saving and loading.

For detailed information about components, please refer to MindCluster Introduction.

Component Version Compatibility

Component NameVersion
Ascend Driver and Firmware25.3.RC1
Ascend Device Plugin7.2.RC1
Ascend Operator7.2.RC1
Ascend Docker Runtime7.2.RC1
NPU Exporter7.2.RC1
Resilience Controller7.2.RC1
ClusterD7.2.RC1
Volcano7.2.RC1 (based on original Volcano 1.9.0)
NodeD7.2.RC1
MindIO7.2.RC1

Application Scenarios

Building clusters based on Ascend devices, supporting cluster job scheduling, operations monitoring, and fault recovery scenarios. NPU Operator can automatically identify Ascend nodes in the cluster and perform corresponding installation and deployment work. For training scenarios, it supports NPU resource detection, full card scheduling, static vNPU scheduling, resuming training from breakpoint, and elastic training. For inference scenarios, it supports resource detection, full card scheduling, static vNPU scheduling, dynamic vNPU scheduling, inference card fault recovery, and rescheduling functions.

Capability Scope

  • Automatically discover Ascend NPU device nodes and label the nodes.
  • Automatically deploy Ascend NPU driver firmware.
  • MindCluster automated deployment installation and lifecycle management for cluster scheduling components.

Highlight Features

NPU Operator can automatically identify Ascend nodes and device models in the cluster, and install the corresponding versions of necessary components for AI runtime, greatly simplifying the threshold for configuring Ascend ecosystem components. It provides full lifecycle management and automated configuration deployment for installed components. NPU Operator can detect component installation status and provide detailed logs for debugging.

Implementation Principle

  • Operator monitors CRD instantiated CR changes to modify managed component states.
  • Operator uses labels marked by NFD on nodes, utilizing the npu-feature-discovery component to label nodes with labels suitable for Ascend component scheduling.

npu-operator

Please ensure that application-management-service and marketplace-service are running normally to ensure this feature can be installed normally from the application market.

Operator Security Context

Some NPU Operator managed Pods (such as driver containers) require elevated privileges as follows.

  • privileged: true
  • hostPID: true
  • hostIPC: true
  • hostNetwork: true

Reasons for elevated privileges:

  • Access host file system and hardware devices, install driver firmware and SDK services on the host machine.
  • Modify device permissions to adapt for non-root users.

Installation

openFuyao Platform Deployment

Online Installation

Prerequisites
  • kubectl and Helm CLI are available on the current computer, or there is a configurable application store or repository in the cluster.

  • Please confirm that the environment contains the bash tool, otherwise the driver firmware installation script parsing may fail.

  • All worker nodes or node groups running NPU workloads in the Kubernetes cluster must have an operating system version that meets openEuler 22.03 LTS or Ubuntu 22.04 (ARM architecture).

    For worker nodes or node groups that only run CPU workloads, nodes can run any operating system, because NPU Operator will not perform any configuration or management on nodes with non-NPU workloads.

    The components installed by the current NPU Operator require the running environment to meet NPU chip models 910B and 310P. For specific OS and hardware compatibility, please refer to MindCluster Documentation.

  • Node Feature Discovery (NFD) and NPU Feature Discovery (NPU-Feature-Discovery) are dependencies of Operator on each node.

    icon Note:
    By default, NFD master and worker nodes are automatically deployed by Operator. If NFD is already running in the cluster, you must disable NFD deployment when installing Operator. Similarly, if NPU-Feature-Discovery has already been deployed to the cluster, you also need to disable NPU-Feature-Discovery deployment when installing Operator.

    values.yaml

    yaml
    nfd:
      enabled: false
    npu-feature-discovery:
      enabled: false

    Check NFD labels on nodes to determine if NFD is already running in the cluster.

    sh
    kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

    If the command output is true, NFD is already running in the cluster. In this case, set nodefeaturerules to install NPU custom node discovery rules.

    yaml
    nfd:
      nodefeaturerules: true

    By default, nfd is true, npu-feature-discovery is true, and nodefeaturerules is false.

Installation Steps

NPU Operator extension component can be downloaded and installed from the openFuyao application market.

  1. Enter the openFuyao platform and select "Application Market > Application List" from the left navigation bar.

  2. Search for "npu-operator" in the application list to find the NPU Operator extension component.

  3. Click the NPU Operator card to enter the application details page.

  4. On the details page, click "Deploy" in the upper right corner, and enter "Application Name", "Version Information", and "Namespace" in the "Installation Information" module on the deployment interface.

  5. Click "Confirm" to successfully deploy the component.

    icon Note:
    Currently, the online installation function supports driver firmware installation for 910B and 310P series models. Online installation of 910C model driver firmware is not currently supported. If you need to install 910C model driver firmware, please refer to the Offline Installation section. When installing and deploying the npu-operator component through the application market, you can modify the corresponding values.yaml parameters. For details, please refer to Table 2.

Offline Installation

Prerequisites
  • Please refer to Online Installation Prerequisites.
  • Download offline images: Download all images used by the components to be installed locally.
  • Prepare driver firmware zip package and MindIO component zip package:
    • Download driver firmware zip package: Go to npu-driver-installer's repository, find the config.json file for the corresponding driver firmware version, and click the corresponding link according to the NPU model and OS architecture of the corresponding node to download the corresponding driver firmware zip package; 910C model NPU supports offline driver firmware installation, which needs to be downloaded separately from Ascend Community - Firmware and Driver Download. Driver firmware zip packages for other models can also be downloaded from the above link.
    • Download MindIO component zip package: Go to npu-node-provision's repository, find the config.json file for the corresponding component version, and click the corresponding link according to the NPU model and OS architecture of the corresponding node to download the corresponding SDK zip package.
  • Place the driver firmware zip file in the node path for offline installation: /tmp/driver_pkg/. This path can be customized. For specific modification methods, please refer to the driver.env field in Table 2.
  • Place the MindIO component zip package in the node path for offline installation: /opt/openFuyao/mindio/
  • Check if the following tools exist in the nodes to be installed:
    • If the package management tool is yum, the packages to be installed are: "jq wget unzip which initscripts coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch kernel-devel-(uname -r) dkms"
    • If the package management tool is apt-get, the packages to be installed are: "jq wget unzip debianutils coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch dkms linux-headers-$(uname -r)"
    • If the package management tool is dnf, the packages to be installed are: "jq wget unzip which initscripts coreutils findutils gawk e2fsprogs util-linux net-tools pciutils gcc make automake autoconf libtool git patch kernel-devel-(uname -r) dkms"
Installation Steps

Please refer to Online Installation Steps.

Standalone Deployment

Online Installation

Prerequisites

Please refer to openFuyao Platform Deployment Prerequisites.

Installation Steps
  1. Add the openFuyao Helm repository.

    shell
    helm repo add openfuyao https://helm.openfuyao.cn && helm repo update
  2. Install NPU Operator.

  • Install Operator using default configuration:

    shell
    helm install --wait --generate-name \
    -n default --create-namespace \
    
    openfuyao/npu-operator

If installing via Helm application store or application market, add the https://helm.openfuyao.cn repository and install through the interface. For details, please refer to Common Customization Options.

icon Note:

  • After installing NPU Operator, labels related to NPU resources will be added to nodes according to different node environments. These labels are related to cluster scheduling components. The accelerate-type label requires the node's hardware server to fully match the NPU card. For specific matching relationships, please refer to MindCluster Documentation's Creating Node Labels.
  • For A800I A2 inference servers, automatic addition of server-usage=infer label is not currently supported. Users need to manually add it using the following command.
bash
kubectl label nodes <node-name> server-usage=infer
Common Customization Options

When using Helm Chart, the following options can be modified. These options are used during Helm installation through --set or --set-json (used when modifying component environment variables, volumes, and other list structures). Due to the large number of configuration items, it is recommended to directly modify the corresponding fields in the values.yaml file in the chart package.

Table 2 lists the most commonly used fields. For other fields, please refer to the repository.

Table 2 Common Options

ScopeDescriptionDefault
nfd.enabledDeploy Node Feature Discovery service NFD. If NFD is already running in the cluster, set this variable to false.
icon Note:
If set to true during installation, try not to modify this field to false during use, otherwise NFD residual labels will remain after uninstallation.
true
nfd.nodefeaturerulesWhen set to true, install NFD discovered NPU device rules CR.true
node-feature-discovery.image.repositoryNFD service image address.registry.k8s.io/nfd/node-feature-discovery
node-feature-discovery.image.pullPolicyNFD service image pull policy.Always
node-feature-discovery.image.tagNFD service image version.v0.16.4
npu-feature-discovery.images.core.repositoryNPU-Feature-Discovery image address.cr.openfuyao.cn/openfuyao/npu-feature-discovery
npu-feature-discovery.images.core.pullPolicyNPU-Feature-Discovery image pull policy.Always
npu-feature-discovery.images.core.tagNPU-Feature-Discovery image version.latest
npu-feature-discovery.enabledSwitch for deploying NPU-Feature-Discovery. If NPU-Feature-Discovery is already running in the cluster, set this variable to false.true
images.operator.repositoryNPU Operator image address.cr.openfuyao.cn/openfuyao/npu-operator
images.operator.tagNPU Operator image version.latest
images.operator.pullPolicyNPU Operator image pull policy.Always
daemonSets.labelsCustom labels to add to all NPU Operator managed Pods.{}
daemonSets.tolerationsCustom tolerations to add to all NPU Operator managed Pods.[]
driver.enabledBy default, Operator deploys NPU driver firmware program as a container on the system.true
images.driver.repositoryDriver firmware program image storage address.cr.openfuyao.cn/openfuyao/npu-driver-installer
images.driver.tagDriver firmware installation service image version.latest
driver.envEnvironment variables related to driver firmware installation service
1. name: HOST_DRIVER_SOURCE_PATH
value: "/tmp/driver_pkg"
2. name: DRIVER_VERSION
value: "25.3.RC1"
1. This environment variable specifies the placement path for the driver firmware zip package
2. This environment variable specifies the online download installation version for the driver firmware zip package.
devicePlugin.enabledBy default, Operator deploys NPU device plugin program on the system. When using Operator on a system with pre-installed device plugins, please set this value to false.
icon Note:
When modifying this field to false, other simultaneously set fields will not take effect.
true
images.devicePlugin.repositoryDevice plugin program image address.hub.oepkgs.net/openfuyao/ascendhub/ascend-k8sdeviceplugin
images.devicePlugin.tagDevice plugin service image version.v7.2.RC1
trainer.enabledBy default, Operator installs Ascend operator. If installation is not needed, set this value to false.true
images.trainer.repositoryAscend operator image address.hub.oepkgs.net/openfuyao/ascendhub/ascend-operator
images.trainer.tagAscend operator image version.v7.2.RC1
ociRuntime.enabledBy default, Operator installs Ascend Docker Runtime. If installation is not needed, set this value to false.true
images.ociRuntime.repositoryAscend Docker Runtime image address.cr.openfuyao.cn/openfuyao/npu-container-toolkit
images.ociRuntime.tagAscend Docker Runtime image version.latest
nodeD.enabledBy default, Operator installs nodeD. If installation is not needed, set this value to false.true
images.nodeD.repositorynodeD image address.hub.oepkgs.net/openfuyao/ascendhub/noded
images.nodeD.tagnodeD image version.v7.2.RC1
clusterd.enabledBy default, Operator installs clusterD component. If installation is not needed, set this value to false.true
images.clusterd.repositoryclusterD image address.hub.oepkgs.net/openfuyao/ascendhub/clusterd
images.clusterd.tagclusterD image version.v7.2.RC1
rscontroller.enabledBy default, Operator installs resilience controller component. If installation is not needed, set this value to false.true
images.rscontroller.repositoryresilience controller image address.hub.oepkgs.net/openfuyao/ascendhub/resilience-controller
images.rscontroller.tagresilience controller image version.v7.1.RC1
exporter.enabledBy default, Operator installs NPU Exporter component. If installation is not needed, set this value to false.true
images.exporter.repositoryNPU Exporter image address.cr.openfuyao.cn/openfuyao/npu-exporter
images.exporter.tagNPU Exporter image version.v7.2.RC1-of.1
mindiotft.enabledBy default, Operator installs MindIO Training Fault Tolerance. If installation is not needed, set this value to false.true
images.mindiotft.repositoryMindIO Training Fault Tolerance image address.cr.openfuyao.cn/openfuyao/npu-node-provision
images.mindiotft.tagMindIO Training Fault Tolerance image version.latest
mindioacp.enabledBy default, Operator installs MindIO Async Checkpoint Persistence. If installation is not needed, set this value to false.true
images.mindioacp.repositoryMindIO Async Checkpoint Persistence service image address.cr.openfuyao.cn/openfuyao/npu-node-provision
images.mindioacp.tagMindIO Async Checkpoint Persistence service image version.latest
mindioacp.versionMindIO Async Checkpoint Persistence version.7.2.RC1
vccontroller.enabledBy default, Operator installs volcano-controller. If installation is not needed, set this value to false.true
images.vccontroller.repositoryvolcano-controller service image address.hub.oepkgs.net/openfuyao/ascendhub/vc-controller-manager
images.vccontroller.tagvolcano-controller service image version.v1.9.0-v7.2.RC1
vcscheduler.enabledBy default, Operator installs volcano-scheduler. If installation is not needed, set this value to false.true
images.vcscheduler.repositoryvolcano-scheduler service image address.hub.oepkgs.net/openfuyao/ascendhub/vc-scheduler
images.vcscheduler.tagvolcano-scheduler service image version.v1.9.0-v7.2.RC1

Offline Installation

Prerequisites

Please refer to openFuyao Platform Deployment Prerequisites.

Installation Steps
  1. Prepare the NPU Operator chart package in advance.

  2. Install Operator using default configuration:

    shell
    cd npu-operator/charts
    helm install <npu-operator release name> npu-operator

    For common customization options, see Table 2.

Upgrade

NPU Operator supports dynamic updates to existing resources. This feature enables NPU Operator to ensure that NPU Policy settings in the cluster are always up to date.

Since Helm does not support automatic upgrades of existing CRDs, you can manually or enable Helm Hooks to upgrade NPU Operator Chart.

NPU Policy CR Update

NPU Operator supports dynamic updates to npuclusterpolicy CustomResource using kubectl.

shell
kubectl get npuclusterpolicy -A
# If the default npuclusterpolicy has not been modified, the default name of npuclusterpolicy is cluster.
kubectl edit npuclusterpolicy cluster

After editing, Kubernetes automatically applies the updates to the cluster. All components managed by NPU Operator will also be updated to the expected state.

Installation Status Verification

View Component Status via CustomResource

View component status through CustomResource npuclusterpolicies.npu.openfuyao.com. The specific method is to check the state field of each component in the status field to confirm the current status of the component. Below is an example of the driver installer running normally.

yaml
status:
  componentStatuses:
    - name: /var/lib/npu-operator/components/driver
      prevState:
        reason: Reconciling
        type: deploying
      state:
        reason: Reconciled
        type: running
  • View CustomResource
bash
$ kubectl get npuclusterpolicies.npu.openfuyao.com cluster -o yaml
apiVersion: npu.openfuyao.com/v1
kind: NPUClusterPolicy
metadata:
  annotations:
    meta.helm.sh/release-name: npu
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2025-03-11T13:22:39Z"
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: npu-operator
  name: cluster
  resourceVersion: "2240086"
  uid: 0d1498c5-143a-4e05-a5dc-376d2e6c96ea
spec:
  clusterd:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/clusterd
      tag: v6.0.0
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/clusterd/clusterd.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  daemonsets:
    imageSpec:
      imagePullPolicy: IfNotPresent
      imagePullSecrets: []
    labels:
      app.kubernetes.io/managed-by: npu-operator
      helm.sh/chart: npu-operator-0.0.0-latest
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Equal
      value: ""
    - effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
      operator: Equal
      value: ""
  devicePlugin:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/ascend-k8sdeviceplugin
      tag: v6.0.0
    initImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: hub.oepkgs.net/busybox:latest
      tag: ""
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/devicePlugin/devicePlugin.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  driver:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/npu-driver-installer
      tag: latest
    initImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: cr.openfuyao.cn/openfuyao/npu-driver-installer:latest
      tag: ""
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/driver/driver.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
    version: 24.1.RC3
  exporter:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/npu-exporter
      tag: v6.0.0
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/npu-exporter/npu-exporter.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  mindioacp:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/npu-node-provision
      tag: latest
    managed: false
    version: 6.0.0
  mindiotft:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/npu-node-provision
      tag: latest
    managed: false
  nodeD:
    heartbeatInterval: 5
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/noded
      tag: v6.0.0
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/noded/noded.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
    pollInterval: 60
  ociRuntime:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/npu-container-toolkit
      tag: latest
    initConfigImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: cr.openfuyao.cn/openfuyao/npu-container-toolkit:latest
      tag: ""
    initRuntimeImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: cr.openfuyao.cn/openfuyao/ascend-image/ascend-docker-runtime:latest
      tag: ""
    interval: 300
    managed: true
  operator:
    imageSpec:
      imagePullPolicy: IfNotPresent
      imagePullSecrets: []
    runtimeClass: ascend
  rscontroller:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/resilience-controller
      tag: v6.0.0
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/resilience-controller/run.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  trainer:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/ascend-operator
      tag: v6.0.0
    initImageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: ""
      repository: hub.oepkgs.net/library/busybox:latest
      tag: ""
    logRotate:
      compress: false
      logFile: /var/log/mindx-dl/ascend-operator/ascend-operator.log
      logLevel: info
      maxAge: 7
      rotate: 30
    managed: true
  vccontroller:
    controllerResources:
      limits:
        cpu: 1000m
        memory: 1Gi
      requests:
        cpu: 1000m
        memory: 1Gi
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/vc-controller-manager
      tag: v1.9.0-v6.0.0
    managed: true
  vcscheduler:
    imageSpec:
      imagePullPolicy: Always
      imagePullSecrets: []
      registry: cr.openfuyao.cn
      repository: openfuyao/ascend-image/vc-scheduler
      tag: v1.9.0-v6.0.0
    managed: true
    schedulerResources:
      limits:
        cpu: 200m
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 1Gi
status:
  componentStatuses:
  - name: /var/lib/npu-operator/components/driver
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/oci-runtime
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/device-plugin
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/trainer
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/noded
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/volcano/volcano-controller
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/volcano/volcano-scheduler
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/clusterd
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/resilience-controller
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/npu-exporter
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: Reconciled
      type: running
  - name: /var/lib/npu-operator/components/mindio/mindiotft
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: ComponentUnmanaged
      type: unmanaged
  - name: /var/lib/npu-operator/components/mindio/mindioacp
    prevState:
      reason: Reconciling
      type: deploying
    state:
      reason: ComponentUnmanaged
      type: unmanaged
  conditions:
  - lastTransitionTime: "2025-03-11T13:25:41Z"
    message: ""
    reason: Ready
    status: "False"
    type: Error
  - lastTransitionTime: "2025-03-11T13:25:41Z"
    message: all components have been successfully reconciled
    reason: Reconciled
    status: "True"
    type: Ready
  namespace: default
  phase: Ready

Manually Verify Installation Status and Running Results of Each Component

  • Driver Installation Status Verification

    To check the driver firmware installation status, use a command like npu-smi info. If the output is similar to the following, it indicates the driver has been installed.

    shell
    
     +------------------------------------------------------------------------------------------------+
    | npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
    +---------------------------+---------------+----------------------------------------------------+
    | NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
    | Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
    +===========================+===============+====================================================+
    | 0     910B3               | OK            | 99.1        55                0    / 0             |
    | 0                         | 0000:C1:00.0  | 0           0    / 0          3162 / 65536         |
    +===========================+===============+====================================================+
    | 1     910B3               | OK            | 91.7        53                0    / 0             |
    | 0                         | 0000:C2:00.0  | 0           0    / 0          3162 / 65536         |
    +===========================+===============+====================================================+
    | 2     910B3               | OK            | 98.2        51                0    / 0             |
    | 0                         | 0000:81:00.0  | 0           0    / 0          3162 / 65536         |
    +===========================+===============+====================================================+
    | 3     910B3               | OK            | 93.2        49                0    / 0             |
    | 0                         | 0000:82:00.0  | 0           0    / 0          3162 / 65536         |
    +===========================+===============+====================================================+
    | 4     910B3               | OK            | 98.8        55                0    / 0             |
    | 0                         | 0000:01:00.0  | 0           0    / 0          3163 / 65536         |
    +===========================+===============+====================================================+
    | 5     910B3               | OK            | 96.2        56                0    / 0             |
    | 0                         | 0000:02:00.0  | 0           0    / 0          3163 / 65536         |
    +===========================+===============+====================================================+
    | 6     910B3               | OK            | 96.9        53                0    / 0             |
    | 0                         | 0000:41:00.0  | 0           0    / 0          3162 / 65536         |
    +===========================+===============+====================================================+
    | 7     910B3               | OK            | 97.6        55                0    / 0             |
    | 0                         | 0000:42:00.0  | 0           0    / 0          3163 / 65536         |
    +===========================+===============+====================================================+
    +---------------------------+---------------+----------------------------------------------------+
    | NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
    +===========================+===============+====================================================+
    | No running processes found in NPU 0                                                            |
    +===========================+===============+====================================================+
    | No running processes found in NPU 1                                                            |
    +===========================+===============+====================================================+
    | No running processes found in NPU 2                                                            |
    +===========================+===============+====================================================+
    | No running processes found in NPU 3                                                            |
    +===========================+===============+====================================================+
    | No running processes found in NPU 4                                                            |
    +===========================+===============+====================================================+
    | No running processes found in NPU 5                                                            |
    +===========================+===============+====================================================+
    | No running processes found in NPU 6                                                            |
    +===========================+===============+====================================================+
    | No running processes found in NPU 7                                                            |
    +===========================+===============+====================================================+
  • MindCluster Component Installation Status Verification

    Use kubectl get pod -A to view all Pods. If all are in Running status, the components have started successfully. For more detailed verification of each component's function status, please refer to MindCluster Official Documentation.

bash

NAMESPACE                NAME                                                      READY   STATUS    RESTARTS         AGE
default                  ascend-runtime-containerd-7lg85                           1/1     Running   0                6m31s
default                  npu-driver-c4744                                          1/1     Running   0                6m31s
default                  npu-operator-77f56c9f6c-fhx8m                             1/1     Running   0                6m32s
default                  npu-feature-discovery-zqgt9                               1/1     Running   0                7m12s
default                  mindio-acp-43f64g63d2v                                    1/1     Running   0                7m21s
default                  mindio-tft-2cc35gs3c2u                                    1/1     Running   0                6m32s
kube-system              ascend-device-plugin-fm4h9                                1/1     Running   0                6m35s
mindx-dl                 ascend-operator-manager-6ff7468bd9-47d7s                  1/1     Running   0                6m50s
mindx-dl                 clusterd-5ffb8f6787-n5m82                                 1/1     Running   0                6m48s
mindx-dl                 noded-kmv8d                                               1/1     Running   0                7m11s 
mindx-dl                 resilience-controller-6727f36c28-wjn3s                    1/1     Running   0                7m20s  
npu-exporter             npu-exporter-b6txl                                        1/1     Running   0                7m22s
volcano-system           volcano-controllers-373749bg23c-mc9cq                     1/1     Running   0                7m31s 
volcano-system           volcano-scheduler-d585db88f-nkxch                         1/1     Running   0                7m40s

Input image description NOTE

The ascend-docker-runtime is installed as a plugin and registered with containerd. To use this feature, specify the runtime as ascend-docker-runtime when starting a container, or specify the runtimeClassName as ascend when creating Kubernetes resources. Example:

ctr run --runtime io.containerd.runc.v2 --runc-binary /var/lib/npu-container-toolkit/runtime/ascend-docker-runtime -t \

--env ASCEND_VISIBLE_DEVICES=0 ubuntu:22.04 <container_id>

Uninstallation

Execute the following steps to uninstall Operator.

  • Execute the following command to delete Operator through Helm CLI or application management interface.

    shell
    helm delete <npu-operator release name>

By default, Helm does not support deleting existing CRDs when deleting Charts.

shell
kubectl get crd npuclusterpolicies.npu.openfuyao.com
  • Execute the following command to manually delete CRD.
shell
kubectl delete crd npuclusterpolicies.npu.openfuyao.com

icon Note:
After Operator uninstallation, the driver may still exist on the host machine.

Component Installation and Uninstallation Instructions

Component Installation and Uninstallation Fields

  • When installing NPU Operator for the first time, if the enabled field for the corresponding component in values.yaml is set to true, it will replace the component resources managed by NPU Operator regardless of whether the component resources existed in the cluster before.

  • If the component already exists in the cluster environment (e.g., volcano-controller), and the component's enabled field in values.yaml is set to false during the first installation of NPU Operator, the existing component resources in the cluster will not be deleted.

  • After NPU Operator installation is complete, modifying the corresponding fields of the CR instance can complete operations such as component image address, resource configuration, and lifecycle management.

MindIO Installation Dependencies

  • If users need to install MindIO related components, they need to install the python environment in the node environment in advance (including pip3 tool). Supported python versions are 3.7-3.11, otherwise normal installation cannot be performed. When using, users can mount the installed corresponding SDK to the training container for use.

  • MindIO TFT (Training Fault Tolerance) component installation path is /opt/sdk/tft, and we provide whl packages for different python versions in /opt/tft-whl-package to meet users' customized needs. For specific usage, please refer to Fault Recovery Acceleration.

  • MindIO ACP (Async Checkpoint Persistence) component installation path is /opt/mindio and /opt/sdk/acp, and we provide whl packages for different python versions in /opt/acp-whl-package for users to install according to their needs. For specific usage instructions, please refer to Checkpoint Save and Load Optimization.

  • When uninstalling MindIO components, the SDK folders related to MindIO components will be cleared, which may cause exceptions in task containers using the service. Please operate with caution.

Helm chart values.yaml Special Field Instructions

  • The driver.env field adds container environment variables for npu-driver-installer. The environment variable corresponding to "HOST_DRIVER_SOURCE_PATH" is the path where the zip package needs to be placed for offline driver firmware zip installation. The current default path is "/tmp/driver_pkg". The environment variable corresponding to "DRIVER_VERSION" is the driver firmware version number, and the default value is "25.3.RC1".
  • The trainer.commandSpec field, taking the ascend-operator component as an example, the commandSpec field provides container startup command configuration for the component, which can be modified, such as setting log level, log path, component startup parameters, etc. For details, please refer to the parameter descriptions of each component in MindCluster Documentation. ascend-device-plugin, npu-exporter, volcano, clusterd, noded and other components all contain this field, and different startup parameters can be configured.
  • The trainer.resources field, taking the ascend-operator component as an example, the resources field provides resource request configuration for the component container. If there is no special need, please apply according to the default configuration, and can dynamically adjust according to specific business and cluster resource conditions. ascend-device-plugin, npu-exporter, volcano, clusterd, noded and other components all contain this field.