Skip to main content
Version: v25.09

AI Inference Software Suite

Feature Overview

The openFuyao AI inference software suite allows users to use foundation model inference capabilities on appliances. It is optimized for Kunpeng and Ascend while maintaining compatibility with mainstream CPU-based computing scenarios.

Application Scenarios

Currently, enterprises primarily use AI inference in web and API scenarios.

  • Web scenario: Users interact with foundation models by typing texts directly into the input box on the web page provided by services like Doubao and DeepSeek.
  • API scenario: Users call APIs to communicate with foundation models and interact with foundation models through programming. This scenario is common in application plug-ins that embed foundation models into existing software, such as Cline and Roo Code.

Supported Capabilities

The AI inference software suite provides one-click installation, deployment, and model inference capabilities. The main functions are as follows:

  • One-click installation and deployment: Supports one-click deployment of openFuyao and foundation model inference components, lowering the deployment threshold.
  • Foundation model APIs: Provides APIs compatible with mainstream products, and allows you to call the foundation model capabilities through APIs.
  • Monitoring: Supports integration of a monitoring dashboard component, allowing users to observe indicators of the inference service.

Highlights

The AI inference software suite integrates with and is compatible with open-source models to implement high scalability and quick deployment and installation. It optimizes affinity for heterogeneous computing power to improve resource utilization.

  • Quick installation and out-of-the-box (lightweight and low noise) capabilities: The AI inference software suite (including the driver and firmware) can be quickly installed based on the openFuyao platform.
  • High scalability and quick customized deployment: The scalable architecture is implemented based on the openFuyao container platform. The appliance solution can be customized based on the openFuyao extension market and external open-source components.
  • Affinity optimization for heterogeneous computing power: Supports and performs affinity optimization for Kunpeng+Ascend computing scenarios, while remaining compatible with mainstream GPU-based computing scenarios.

Implementation Principles

Figure 1 Implementation principles of the AI inference software suite

image

Inference job: In a complete inference job, a user initiates an inference job request using an OpenAI API, and the inference job is sent to a RayCluster in a Kubernetes cluster. After receiving the request, RayCluster delivers the job to the head Pod and worker Pod in the cluster. Once the head Pod and worker Pod complete the inference job, they return the result along the original path. Figure 2 shows the inference job sequence diagram.

Figure 2 Inference job sequence diagram of the AI inference software suite

image

  • Cluster monitoring: Users or administrators can view the status of Ray clusters by using the visualized monitoring dashboard custom-monitor. custom-monitor obtains indicators from the monitoring component k8s-monitor in the cluster.
  • One-click installation and deployment: The AI inference software suite supports one-click deployment of the openFuyao management platform.

The AI inference software suite has two kinds of relationships with other features: unilateral dependency and joint usage.

  • Unilateral dependency

    Unilateral dependency is a prerequisite feature required for running the AI inference software suite, as shown in Table 1.

Table 1 Unilateral dependency

ComponentLicenseVersionDescription
NPU OperatorMulanPSL-2.00.0.0-latestopenFuyao ecosystem component, which is a Kubernetes cluster management tool for Ascend NPUs.
GPU OperatorApache-2.0 Licensev25.3.2Ecosystem component provided by NVIDIA, which is a Kubernetes cluster management tool for NVIDIA GPUs.
KubeRay OperatorApache-2.0 Licensev1.4.0Simplified deployment and management of Ray applications in the Kubernetes cluster.
PrometheusApache-2.0 license-Kubernetes native monitoring component, which directly monitors cluster services and exposes indicators in standard format for querying.

note NOTE
The GPU operator is an optional third-party dependency and needs to be installed only when NVIDIA GPU inference is used. The GPU operator is provided based on the NVIDIA official component. openFuyao does not modify the content of the GPU operator. For details about additional terms and restrictions when using related components, see the NVIDIA official documents.

  • Joint usage

    Joint usage is a feature decoupled from the AI inference software suite. In the current version, users can select features for visualized monitoring as required.

Table 2 Joint usage

ComponentLicenseVersionDescription
monitoring-dashboardMulanPSL-2.00.0.0-latestopenFuyao ecosystem component. Users can add indicators and customize the monitoring dashboard as required.
monitoring-serviceMulanPSL-2.00.0.0-latestopenFuyao ecosystem component, which is used to monitor, collect, and analyze the status of resources and applications in a cluster in real time.

Installation

One-click installation and deployment of the AI inference software suite refers to one-click deployment of appliances in the openFuyao application market.

openFuyao Platform

  1. In the navigation pane of openFuyao, choose Application Market > Applications. The Applications page is displayed.
  2. Select AI/Machine learning in the Scenario area on the left and search for the aiaio-installer card. Alternatively, enter aiaio-installer in the search box.
  3. Click the aiaio-installer card to go to the Details page of the extension.
  4. Click Deploy. The Deploy page is displayed.
  5. Enter the application name and select the desired installation version and namespace. You can select an existing namespace or create a namespace. For details about how to create a namespace, see Namespace.
  6. On the Values.yaml tab page in the Parameter Configuration area, enter the values to be deployed. There are many parameters that can be configured for this component. For details, see Parameter Configuration Description.
  7. Click Deploy to complete the application installation.

note NOTE
You can create a directory based on the name of an application. Create the /mnt/<Application name>-storage directory on the server where the application is to be deployed. For example, if the application name is aiaio, you need to create the /mnt/aiaio-storage directory.

Parameter Configuration Description

Table 3 Parameters in values.yaml

Configuration CategoryParameterData TypeDescriptionOptional Value/Remarks
acceleratorGPUbooleanWhether to use GPUs for inferencetrue or false. Either GPU or NPU can be set to true.
acceleratorNPUbooleanWhether to use NPUs for inferencetrue or false. Either GPU or NPU can be set to true.
acceleratortypestringHardware accelerator typeNPU: Ascend 910B and Ascend 910B4; GPU: V100
acceleratornumintegerHardware accelerator quantitySet this parameter to tensor_parallel_size × pipeline_parallel_size based on the model size and recommended configurations.
serviceimagestringInference service image pathRecommended NPU: harbor.openfuyao.com/openfuyao/ai-all-in-one:latest
GPUs need to be built using build/Dockerfile.gpu.
servicemodelstringInference model nameModel path, for example, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B.
servicetensor_parallel_sizeintegerTensor parallelism sizeSet this parameter based on the model size and recommended configurations.
servicepipeline_parallel_sizeintegerPipeline parallelism sizeSet this parameter based on the model size and recommended configurations.
servicemax_model_lenintegerMaximum sequence length of a modelSet this parameter based on the model size and hardware accelerator configuration. The recommended value is 16384. You can adjust the value as required.
servicevllm_use_v1integerWhether to use vLLM v1The recommended value is 1.
storageaccessModestringStorage access modeReadWriteMany is recommended.
storagesizestringStorage sizeSet this parameter based on the model size and recommended configurations. The recommended value is 1.2 times the model size or larger.
storagestorageClassNamestringStorage class nameSet this parameter based on the actual storage class of the cluster. This parameter is invalid when default_pv is set to true.
storagedefault_pvbooleanWhether to use the default persistent volumeIf no storage class is preconfigured in the cluster, you are advised to set this parameter to true.

Table 4 Recommended model configurations

Model Sizeservice.modelaccelerator.GPUaccelerator.NPUaccelerator.typeaccelerator.numservice.tensor_parallel_sizeservice.pipeline_parallel_sizestorage.size
1.5Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5BfalsetrueAscend910B411110Gi
7Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-7BfalsetrueAscend910B411120Gi
8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-8BfalsetrueAscend910B41/21/2125Gi
14Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14BfalsetrueAscend910B422140Gi
32Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-32BfalsetrueAscend910B444180Gi
70Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-70BfalsetrueAscend910B48/168/161160Gi

note NOTE

  • Only the recommended NPU configurations are provided. The recommended GPU configurations are not provided because GPU support is in the early stage.
  • In the preceding table, 1/2 indicates that the minimum configuration is 1 and the recommended configuration is 2. The same applies to other configurations.
  • For details about the complete configuration, see Configuration Example.

Deployment Views

Currently, the AI inference software suite consists of three components: the Ascend NPU management component NPU operator (or the GPU management component GPU operator), the openFuyao Ray management component KubeRay operator, and the cluster component RayCluster. Depending on the size of the deployed model, RayCluster contains one head Pod and may contain zero or more worker Pods. These components do not need to be deployed by users. The deployment program automatically starts the components. For the deployment views, see Figure 3 (containing only the head Pod) and Figure 4 (containing the head Pod and one worker Pod).

Figure 3 Deployment view 1 of the AI inference software suite

image

Figure 4 Deployment view 2 of the AI inference software suite

image

Submitting an Inference Job Using OpenAI APIs

Prerequisites

The AI inference software suite service has been deployed on the node.

Context

You can use OpenAI APIs to submit inference jobs and view inference results.

  • Submitting an inference job: Construct an OpenAI HTTP request and send the request to the inference service address.
  • Viewing the inference result: After completing the job, the inference service sends the result to the user address and displays the result on the user WebUI.

Figure 5 Inference job

image

Constraints

Currently, only plain-text inference jobs are supported.

Procedure

  1. Construct an inference job: Construct an inference job on the client. The JSON structure of the inference job is as follows:

    {
    "method": "POST",
    "url": "http://localhost:8000/v1/chat/completions",
    "headers": {
    "Content-Type": "application/json",
    "Authorization": "Bearer fake-key"
    },
    "body": {
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "messages": [
    {
    "role": "user",
    "content": "Hello!"
    }
    ]
    }
    }
  2. Obtain the cluster IP address of the Kubernetes service (ending with serve-svc) of RayService in Kubernetes and construct an inference request Uniform Resource Identifier (URI). URI: http://<cluster_ip>:8000/v1/chat/completions

  3. Use a command-line tool, such as curl, or a programming language, such as Python, to initiate an HTTP POST request.

  4. The inference service in the AI inference software suite handles the inference job and returns the result to the client along the original path.

Appendixes

Configuration Examples

NPU Configuration Example (14B Model)

accelerator:
GPU: false
NPU: true
type: "Ascend910B4"
num: 2

service:
image: cr.openfuyao.cn/openfuyao/ai-all-in-one:latest
model: "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
tensor_parallel_size: 2
pipeline_parallel_size: 1
max_model_len: 16384
vllm_use_v1: 1

storage:
accessMode: ReadWriteMany
size: 40Gi
storageClassName: example-sc
default_pv: true

GPU Configuration Example (14B Model)

accelerator:
GPU: true
NPU: false
type: "V100"
num: 2

service:
image: "your-custom-gpu-image:latest" # Use **build/Dockerfile.gpu**.
model: "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
tensor_parallel_size: 2
pipeline_parallel_size: 1
max_model_len: 16384
vllm_use_v1: 0 # V100 does not support the vllm v1 engine.

storage:
accessMode: ReadWriteMany
size: 40Gi
storageClassName: example-sc
default_pv: true