PD Disaggregation

Feature Introduction

PD Disaggregation (Prefill-Decode Disaggregation) is a disaggregated decoding deployment approach centered around KVCache. It deploys the Prefill and Decode stages in two independent clusters, fully utilizing CPU, DRAM, and SSD resources to implement hierarchical caching of KVCache. It aims to maximize throughput while meeting latency-related Service Level Objectives (SLO), breaking through inference service performance bottlenecks and demonstrating efficient performance in long-context input scenarios.

Application Scenarios

Use a disaggregated inference framework based on KVCache center to deploy the Prefill stage and Decode stage separately in independent clusters.

Capability Scope

PD Disaggregation can fully utilize CPU, DRAM, and SSD resources to improve throughput, especially demonstrating efficient performance in long-context input scenarios.

Highlights

Batch Integration:
In the original serial queue, regardless of the prompt length in the request, the Prefill node would execute one request completely before fetching the next request from the queue. For scenarios with short user inputs, using a serial execution approach results in significant NPU computing power waste. We integrate the queue consumption mechanism based on user input length and actual load. For example, for requests with length less than 1024 tokens, there will be ≥1 request in the batch. By merging batches, we improve execution parallelism and overall throughput metrics.
PD Load Awareness:
The absolute prefill length of prefix cache hits ≤ preset threshold. On one hand, if the request's prefill length is very short, it can be effectively computed in the decode engine by carrying chunked prefill requests with ongoing decode requests. On the other hand, if the prefix cache hit is very long, prefill becomes memory-bound rather than compute-bound, so it can be more efficiently computed in the decode engine. The number of remote prefill requests in the prefill queue < preset threshold. When there are a large number of requests in the prefill queue, it indicates that the prefill worker is lagging behind, and it's better to compute locally until more prefill workers join.

Implementation Principle

PD Disaggregation adopts a disaggregated inference framework centered on KVCache, deploying the Prefill and Decode stages in two independent clusters, allocating resources according to the resource requirements of both stages, ultimately achieving a balance between low latency and high throughput.

Mooncake: Mooncake serves as a communication tool in the PD Disaggregation process, enabling rapid transmission of data such as KVCache between the Prefill and Decode stages.
vLLM: vLLM can serve as the inference computation component for the Prefill or Decode stage in the PD Disaggregation architecture, providing inference interfaces.

Installation

Prerequisites

Shared storage mounting and parameter plane connectivity need to be checked as follows.

Shared Storage Mounting
1. Use ll /mnt to check if storage is automatically mounted;
2. If not mounted, use mount -t dpc /kdxf /mnt to mount.
Parameter Plane Connectivity
1. Check network port up/down status: for i in {0..7}; do hccn_tool -i $i -link -g; done, it's normal if no port is down;
2. Check if there are network port flapping: for i in {0..7}; do hccn_tool -i $i -link_stat -g; done, it's normal if there's no flapping;
3. Check tls configuration: for i in {0..7};do hccn_tool -i $i -tls -g|grep "tls switch";done, normal situation as follows:
  If there's a non-zero case: for i in {0..7};do hccn_tool -i $i -tls -s enable 0;done, configure tls to 0.
4. Execute the following command to disable the firewall.
  bash
```
systemctl stop firewalld
systemctl disable firewalld
```
5. Parameter plane network connectivity check method: Try to ping the IP address of one card using all cards in the cluster.
  5.1 Execute cat /etc/hccn.conf on one server, record a device ip in the cluster environment, and represent this ip as 【ip】 in subsequent steps;
  5.2 Execute on all servers currently in use, for i in {0..7}; do hccn_tool -i $i -ping -g address 【ip】; done.
  5.3 If all can ping through, there's no problem with parameter plane network connectivity.

Installation Steps

Image Download
Download Ubuntu 22.04 version image mindie_dev-2.1.RC1.B152-800I-A2-py311-ubuntu22.04-aarch64.tar.gz and upload to the server.

Start Container

2.1 Execute the following command to enter the directory where the image is located and load the image.

bash

docker load -i mindie-XXX.tar.gz

2.2 Execute the following command to view the loaded image id.

bash

docker images

2.3 Execute the following command to start the container.

bash

docker run --name openfuyao_pd  -it -d --net=host --shm-size=500g \
--privileged=true \
-w /home \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device=/dev/devmm_svm \
--entrypoint=bash \
-v /mnt:/mnt \
-v /data:/data \
-v /dev:/dev \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /home:/home \
-v /tmp:/tmp \
-v /opt:/opt \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \
-v XXX \
bc6711a1ef08 # Use the previously viewed image id here

Git Clone

3.1 Execute the following command to enter the container.

bash

docker exec -it openfuyao_pd bash  # Replace openfuyao_pd with the actual container id or container name, container information can be viewed with docker ps

3.2 Execute the following commands to configure proxy and git clone.

bash

    # Network configuration
    export http_proxy=http://[account]:[password]@proxyhk.huawei.com:8080
    export https_proxy=http://[account]:[password]@proxyhk.huawei.com:8080
    export no_proxy=127.0.0.1,localhost,local,.local
    git config --global http.sslVerify false
    # Download Mooncake
    git clone https://github.com/AscendTransport/Mooncake
    # Download vLLM
    git clone https://github.com/vllm-project/vllm
    # Download vllm-ascend
    git clone https://gitcode.com/openFuyao/vllm-ascend

Mooncake Installation
Open the Mooncake/scripts/ascend/dependencies_ascend.sh script file in the cloned folder, find the following two lines of whl package installation code, comment them out to avoid duplicate installation.
text
```
pip install mooncake-wheel/dist/*.whl --force
echo -e "Mooncake wheel pipinstall successfully."
```
Then run the installation script.
bash
```
cd Mooncake/scripts/ascend
# Comment out the last two lines of whl package installation, duplicate installation will cause errors
bash dependencies_ascend.sh
```
vLLM Installation
Execute the following command to enter the vllm directory and run the installation script.
bash
```
cd vllm
VLLM_TARGET_DEVICE=empty pip install -v -e .
```
vllm-ascend Installation
Enter the vllm-ascend directory, modify the torch-nup version in both requirements.txt and pyproject.toml files to torch-npu==2.7.1rc1, then run the following installation script.
bash
```
cd vllm-ascend/
# Modify requirements.txt, pyproject.toml
COMPILE_CUSTOM_KERNELS=0 pip install -v -e ./
```
You can check if the installation was successful with the following command.
bash
```
pip list | grep vllm
```
If installation is successful, there will be two lines of output similar to the following.
text
```
vllm                              0.10.2rc2.dev7+g7c8271cd1.empty
vllm_ascend                       0.1.dev789+g9b418ede4.d20250902
```
torch-npu Installation
Download the corresponding version of torch_npu package: torch_npu-2.7.1.dev20250724-cp311-cp311-manylinux_2_28_aarch64.whl and upload to the server.
Execute the following installation command in the corresponding directory.
bash
```
pip install --no-index --no-deps --force-reinstall torch_npu**
```

Using PD Disaggregation to Launch Instances

Prerequisites

Mooncake, vLLM, and vllm-ascend have been installed in the container.

Background Information

PD Disaggregation deploys the Prefill and Decode stages in two independent clusters to achieve full resource utilization and throughput improvement.

Usage Restrictions

During the launch process, the Port used by each Prefill and Decode instance cannot conflict, and the tp_size of the Prefill instance should not be less than the tp_size of Decode.

Operation Steps

PD Disaggregation Launch

1.1 Prepare scripts run_prefill.sh and run_decode.sh. Note to replace the model file path ./Qwen3-8B with the actual file path.

run_prefill.sh

bash

export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export HCCL_IF_IP=0.0.0.0
export GLOO_SOCKET_IFNAME="eth0"
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export ASCEND_RT_VISIBLE_DEVICES=0,1

export VLLM_USE_V1=1
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH

echo "HCCL_IF_IP=$HCCL_IF_IP"

vllm serve ./Qwen3-8B \
  --host 0.0.0.0\
  --port 8107 \
  --tensor-parallel-size 2\
  --seed 1024 \
  --max-model-len 2000  \
  --max-num-batched-tokens 2000  \
  --trust-remote-code \
  --enforce-eager \
  --data-parallel-size 1 \
  --data-parallel-address localhost \
  --data-parallel-rpc-port 9100 \
  --gpu-memory-utilization 0.8  \
  --kv-transfer-config  \
  '{"kv_connector": "MooncakeConnectorV1",
  "kv_buffer_device": "npu",
  "kv_role": "kv_producer",
  "kv_parallel_size": 1,
  "kv_port": "20001",
  "engine_id": "0",
  "kv_rank": 0,
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 2
             },
             "decode": {
                    "dp_size": 1,
                    "tp_size": 2
             }
      }
  }'

run_decode.sh

bash

export HCCL_EXEC_TIMEOUT=204
export HCCL_CONNECT_TIMEOUT=120
export HCCL_IF_IP=0.0.0.0
export GLOO_SOCKET_IFNAME="eth0"
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export ASCEND_RT_VISIBLE_DEVICES=2,3

export VLLM_USE_V1=1
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH


echo "HCCL_IF_IP=$HCCL_IF_IP"


vllm serve ./Qwen3-8B \
  --host 0.0.0.0 \
  --port 8207 \
  --tensor-parallel-size 2\
  --seed 1024 \
  --max-model-len 2000  \
  --max-num-batched-tokens 2000  \
  --trust-remote-code \
  --enforce-eager \
  --data-parallel-size 1 \
  --data-parallel-address localhost \
  --data-parallel-rpc-port 9100 \
  --gpu-memory-utilization 0.8  \
  --no-enable-prefix-caching \
  --kv-transfer-config  \
  '{"kv_connector": "MooncakeConnectorV1",
  "kv_buffer_device": "npu",
  "kv_role": "kv_consumer",
  "kv_parallel_size": 1,
  "kv_port": "20002",
  "engine_id": "1",
  "kv_rank": 1,
  "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
  "kv_connector_extra_config": {
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 2
             },
             "decode": {
                    "dp_size": 1,
                    "tp_size": 2
             }
      }
  }'

1.2 Execute the following commands to complete the launch respectively.

bash

bash run_prefill.sh
bash run_decode.sh

1.3 Create a new port to start the service, note that prefiller-hosts, prefiller-ports, decoder-hosts, decoder-ports correspond to the settings in the script.

bash

cd vllm-ascend/examples/disaggregated_prefill_v1/

python load_balance_proxy_server_example.py --host 0.0.0.0 --prefiller-hosts 127.0.0.1 --prefiller-ports 8107 --decoder-hosts 127.0.0.1 --decoder-ports 8207 --port 9090

Service Launch Verification
Create a new port for testing. Note to replace the model file path ./Qwen3-8B with the actual file path, and the Port accessed for testing needs to be consistent with the Port specified when starting the service (both are 9090 in the example).
bash
```
curl -v -s http://127.0.0.1:9090/v1/completions -H "Content-Type: application/json" -d '{"model": "./Qwen3-8B","prompt": "你好,你是谁?","max_tokens": 256}'
```
Similar responses indicate successful configuration.

Follow-up Operations

To enable batch integration, you need to add the following environment variable in the Prefill instance launch script.
bash
```
export BATCH_BY_REQUEST_LENGTH="true"
```
And add the following parameter in the vllm serve command of the Prefill instance launch script.
bash
```
--scheduling-policy "priority"
```
To modify other configurations in the PD Disaggregation launch instance, please refer to vllm-ascend/examples/disaggregated_prefill_v1/mooncake_connector_deployment_guide.md at main · vllm-project/vllm-ascend · GitHub.

PD Disaggregation ​

Feature Introduction ​

Application Scenarios ​

Capability Scope ​

Highlights ​

Implementation Principle ​

Relationship with Related Features ​

Installation ​

Prerequisites ​

Installation Steps ​

Using PD Disaggregation to Launch Instances ​

Prerequisites ​

Background Information ​

Usage Restrictions ​

Operation Steps ​

Follow-up Operations ​

PD Disaggregation

Feature Introduction

Application Scenarios

Capability Scope

Highlights

Implementation Principle

Relationship with Related Features

Installation

Prerequisites

Installation Steps

Using PD Disaggregation to Launch Instances

Prerequisites

Background Information

Usage Restrictions

Operation Steps

Follow-up Operations