Version: v26.03

PD Disaggregation

Feature Introduction

PD Disaggregation (Prefill-Decode Disaggregation) is a disaggregated decoding deployment approach centered around KVCache. It deploys the Prefill and Decode stages in two independent clusters, fully utilizing CPU, DRAM, and SSD resources to implement hierarchical caching of KVCache. It aims to maximize throughput while meeting latency-related Service Level Objectives (SLO), breaking through inference service performance bottlenecks and demonstrating efficient performance in long-context input scenarios.

Application Scenarios

Use a disaggregated inference framework based on KVCache center to deploy the Prefill stage and Decode stage separately in independent clusters.

Capability Scope

PD Disaggregation can fully utilize CPU, DRAM, and SSD resources to improve throughput, especially demonstrating efficient performance in long-context input scenarios.

Highlights

  • Batch Integration:

    In the original serial queue, regardless of the prompt length in the request, the Prefill node would execute one request completely before fetching the next request from the queue. For scenarios with short user inputs, using a serial execution approach results in significant NPU computing power waste. We integrate the queue consumption mechanism based on user input length and actual load. For example, for requests with length less than 1024 tokens, there will be ≥1 request in the batch. By merging batches, we improve execution parallelism and overall throughput metrics.

  • PD Load Awareness:

    The absolute prefill length of prefix cache hits ≤ preset threshold. On one hand, if the request's prefill length is very short, it can be effectively computed in the decode engine by carrying chunked prefill requests with ongoing decode requests. On the other hand, if the prefix cache hit is very long, prefill becomes memory-bound rather than compute-bound, so it can be more efficiently computed in the decode engine. The number of remote prefill requests in the prefill queue < preset threshold. When there are a large number of requests in the prefill queue, it indicates that the prefill worker is lagging behind, and it's better to compute locally until more prefill workers join.

Implementation Principle

PD Disaggregation adopts a disaggregated inference framework centered on KVCache, deploying the Prefill and Decode stages in two independent clusters, allocating resources according to the resource requirements of both stages, ultimately achieving a balance between low latency and high throughput.

  • Mooncake: Mooncake serves as a communication tool in the PD Disaggregation process, enabling rapid transmission of data such as KVCache between the Prefill and Decode stages.
  • vLLM: vLLM can serve as the inference computation component for the Prefill or Decode stage in the PD Disaggregation architecture, providing inference interfaces.

Installation

Prerequisites

Shared storage mounting and parameter plane connectivity need to be checked as follows.

  • Shared Storage Mounting

    1. Use ll /mnt to check if storage is automatically mounted;
    2. If not mounted, use mount -t dpc /kdxf /mnt to mount.
  • Parameter Plane Connectivity

    1. Check network port up/down status: for i in {0..7}; do hccn_tool -i $i -link -g; done, it's normal if no port is down;

    2. Check if there are network port flapping: for i in {0..7}; do hccn_tool -i $i -link_stat -g; done, it's normal if there's no flapping;

    3. Check tls configuration: for i in {0..7};do hccn_tool -i $i -tls -g|grep "tls switch";done, normal situation as follows:

      check

      If there's a non-zero case: for i in {0..7};do hccn_tool -i $i -tls -s enable 0;done, configure tls to 0.

    4. Execute the following command to disable the firewall.

      bash
      systemctl stop firewalld
      systemctl disable firewalld
    5. Parameter plane network connectivity check method: Try to ping the IP address of one card using all cards in the cluster.

      5.1 Execute cat /etc/hccn.conf on one server, record a device ip in the cluster environment, and represent this ip as 【ip】 in subsequent steps;

      5.2 Execute on all servers currently in use, for i in {0..7}; do hccn_tool -i $i -ping -g address 【ip】; done.

      5.3 If all can ping through, there's no problem with parameter plane network connectivity.

Installation Steps

  1. Image Download

    Download Ubuntu 22.04 version image mindie_dev-2.1.RC1.B152-800I-A2-py311-ubuntu22.04-aarch64.tar.gz and upload to the server.

  2. Start Container

    2.1 Execute the following command to enter the directory where the image is located and load the image.

    bash
    docker load -i mindie-XXX.tar.gz

    2.2 Execute the following command to view the loaded image id.

    bash
    docker images

    2.3 Execute the following command to start the container.

    bash
    docker run --name openfuyao_pd  -it -d --net=host --shm-size=500g \
    --privileged=true \
    -w /home \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --device=/dev/devmm_svm \
    --entrypoint=bash \
    -v /mnt:/mnt \
    -v /data:/data \
    -v /dev:/dev \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/sbin:/usr/local/sbin \
    -v /home:/home \
    -v /tmp:/tmp \
    -v /opt:/opt \
    -v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
    -v XXX \
    bc6711a1ef08 # Use the previously viewed image id here
  3. Git Clone

    3.1 Execute the following command to enter the container.

    bash
    docker exec -it openfuyao_pd bash  # Replace openfuyao_pd with the actual container id or container name, container information can be viewed with docker ps

    3.2 Execute the following commands to configure proxy and git clone.

    bash
        # Network configuration
        export http_proxy=http://[account]:[password]@proxyhk.huawei.com:8080
        export https_proxy=http://[account]:[password]@proxyhk.huawei.com:8080
        export no_proxy=127.0.0.1,localhost,local,.local
        git config --global http.sslVerify false
        # Download Mooncake
        git clone https://github.com/AscendTransport/Mooncake
        # Download vLLM
        git clone https://github.com/vllm-project/vllm
        # Download vllm-ascend
        git clone https://gitcode.com/openFuyao/vllm-ascend
  4. Mooncake Installation

    Open the Mooncake/scripts/ascend/dependencies_ascend.sh script file in the cloned folder, find the following two lines of whl package installation code, comment them out to avoid duplicate installation.

    text
    pip install mooncake-wheel/dist/*.whl --force
    echo -e "Mooncake wheel pipinstall successfully."

    Then run the installation script.

    bash
    cd Mooncake/scripts/ascend
    # Comment out the last two lines of whl package installation, duplicate installation will cause errors
    bash dependencies_ascend.sh
  5. vLLM Installation

    Execute the following command to enter the vllm directory and run the installation script.

    bash
    cd vllm
    VLLM_TARGET_DEVICE=empty pip install -v -e .
  6. vllm-ascend Installation

    Enter the vllm-ascend directory, modify the torch-nup version in both requirements.txt and pyproject.toml files to torch-npu==2.7.1rc1, then run the following installation script.

    bash
    cd vllm-ascend/
    # Modify requirements.txt, pyproject.toml
    COMPILE_CUSTOM_KERNELS=0 pip install -v -e ./

    You can check if the installation was successful with the following command.

    bash
    pip list | grep vllm

    If installation is successful, there will be two lines of output similar to the following.

    text
    vllm                              0.10.2rc2.dev7+g7c8271cd1.empty
    vllm_ascend                       0.1.dev789+g9b418ede4.d20250902
  7. torch-npu Installation

    Download the corresponding version of torch_npu package: torch_npu-2.7.1.dev20250724-cp311-cp311-manylinux_2_28_aarch64.whl and upload to the server.

    Execute the following installation command in the corresponding directory.

    bash
    pip install --no-index --no-deps --force-reinstall torch_npu**

Using PD Disaggregation to Launch Instances

Prerequisites

Mooncake, vLLM, and vllm-ascend have been installed in the container.

Background Information

PD Disaggregation deploys the Prefill and Decode stages in two independent clusters to achieve full resource utilization and throughput improvement.

Usage Restrictions

During the launch process, the Port used by each Prefill and Decode instance cannot conflict, and the tp_size of the Prefill instance should not be less than the tp_size of Decode.

Operation Steps

  1. PD Disaggregation Launch

    1.1 Prepare scripts run_prefill.sh and run_decode.sh. Note to replace the model file path ./Qwen3-8B with the actual file path.

    run_prefill.sh

    bash
    export HCCL_EXEC_TIMEOUT=204
    export HCCL_CONNECT_TIMEOUT=120
    export HCCL_IF_IP=0.0.0.0
    export GLOO_SOCKET_IFNAME="eth0"
    export TP_SOCKET_IFNAME="eth0"
    export HCCL_SOCKET_IFNAME="eth0"
    export ASCEND_RT_VISIBLE_DEVICES=0,1
    
    export VLLM_USE_V1=1
    export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
    
    echo "HCCL_IF_IP=$HCCL_IF_IP"
    
    vllm serve ./Qwen3-8B \
      --host 0.0.0.0\
      --port 8107 \
      --tensor-parallel-size 2\
      --seed 1024 \
      --max-model-len 2000  \
      --max-num-batched-tokens 2000  \
      --trust-remote-code \
      --enforce-eager \
      --data-parallel-size 1 \
      --data-parallel-address localhost \
      --data-parallel-rpc-port 9100 \
      --gpu-memory-utilization 0.8  \
      --kv-transfer-config  \
      '{"kv_connector": "MooncakeConnectorV1",
      "kv_buffer_device": "npu",
      "kv_role": "kv_producer",
      "kv_parallel_size": 1,
      "kv_port": "20001",
      "engine_id": "0",
      "kv_rank": 0,
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
      "kv_connector_extra_config": {
                "prefill": {
                        "dp_size": 1,
                        "tp_size": 2
                 },
                 "decode": {
                        "dp_size": 1,
                        "tp_size": 2
                 }
          }
      }'

    run_decode.sh

    bash
    export HCCL_EXEC_TIMEOUT=204
    export HCCL_CONNECT_TIMEOUT=120
    export HCCL_IF_IP=0.0.0.0
    export GLOO_SOCKET_IFNAME="eth0"
    export TP_SOCKET_IFNAME="eth0"
    export HCCL_SOCKET_IFNAME="eth0"
    export ASCEND_RT_VISIBLE_DEVICES=2,3
    
    export VLLM_USE_V1=1
    export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
    
    
    echo "HCCL_IF_IP=$HCCL_IF_IP"
    
    
    vllm serve ./Qwen3-8B \
      --host 0.0.0.0 \
      --port 8207 \
      --tensor-parallel-size 2\
      --seed 1024 \
      --max-model-len 2000  \
      --max-num-batched-tokens 2000  \
      --trust-remote-code \
      --enforce-eager \
      --data-parallel-size 1 \
      --data-parallel-address localhost \
      --data-parallel-rpc-port 9100 \
      --gpu-memory-utilization 0.8  \
      --no-enable-prefix-caching \
      --kv-transfer-config  \
      '{"kv_connector": "MooncakeConnectorV1",
      "kv_buffer_device": "npu",
      "kv_role": "kv_consumer",
      "kv_parallel_size": 1,
      "kv_port": "20002",
      "engine_id": "1",
      "kv_rank": 1,
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
      "kv_connector_extra_config": {
                "prefill": {
                        "dp_size": 1,
                        "tp_size": 2
                 },
                 "decode": {
                        "dp_size": 1,
                        "tp_size": 2
                 }
          }
      }'

    1.2 Execute the following commands to complete the launch respectively.

    bash
    bash run_prefill.sh
    bash run_decode.sh

    1.3 Create a new port to start the service, note that prefiller-hosts, prefiller-ports, decoder-hosts, decoder-ports correspond to the settings in the script.

    bash
    cd vllm-ascend/examples/disaggregated_prefill_v1/
    
    python load_balance_proxy_server_example.py --host 0.0.0.0 --prefiller-hosts 127.0.0.1 --prefiller-ports 8107 --decoder-hosts 127.0.0.1 --decoder-ports 8207 --port 9090
  2. Service Launch Verification

    Create a new port for testing. Note to replace the model file path ./Qwen3-8B with the actual file path, and the Port accessed for testing needs to be consistent with the Port specified when starting the service (both are 9090 in the example).

    bash
    curl -v -s http://127.0.0.1:9090/v1/completions -H "Content-Type: application/json" -d '{"model": "./Qwen3-8B","prompt": "你好,你是谁?","max_tokens": 256}'

    Similar responses indicate successful configuration.

    test

Follow-up Operations