PD Disaggregation
Feature Introduction
PD Disaggregation (Prefill-Decode Disaggregation) is a disaggregated decoding deployment approach centered around KVCache. It deploys the Prefill and Decode stages in two independent clusters, fully utilizing CPU, DRAM, and SSD resources to implement hierarchical caching of KVCache. It aims to maximize throughput while meeting latency-related Service Level Objectives (SLO), breaking through inference service performance bottlenecks and demonstrating efficient performance in long-context input scenarios.
Application Scenarios
Use a disaggregated inference framework based on KVCache center to deploy the Prefill stage and Decode stage separately in independent clusters.
Capability Scope
PD Disaggregation can fully utilize CPU, DRAM, and SSD resources to improve throughput, especially demonstrating efficient performance in long-context input scenarios.
Highlights
Batch Integration:
In the original serial queue, regardless of the prompt length in the request, the Prefill node would execute one request completely before fetching the next request from the queue. For scenarios with short user inputs, using a serial execution approach results in significant NPU computing power waste. We integrate the queue consumption mechanism based on user input length and actual load. For example, for requests with length less than 1024 tokens, there will be ≥1 request in the batch. By merging batches, we improve execution parallelism and overall throughput metrics.
PD Load Awareness:
The absolute prefill length of prefix cache hits ≤ preset threshold. On one hand, if the request's prefill length is very short, it can be effectively computed in the decode engine by carrying chunked prefill requests with ongoing decode requests. On the other hand, if the prefix cache hit is very long, prefill becomes memory-bound rather than compute-bound, so it can be more efficiently computed in the decode engine. The number of remote prefill requests in the prefill queue < preset threshold. When there are a large number of requests in the prefill queue, it indicates that the prefill worker is lagging behind, and it's better to compute locally until more prefill workers join.
Implementation Principle
PD Disaggregation adopts a disaggregated inference framework centered on KVCache, deploying the Prefill and Decode stages in two independent clusters, allocating resources according to the resource requirements of both stages, ultimately achieving a balance between low latency and high throughput.
Relationship with Related Features
- Mooncake: Mooncake serves as a communication tool in the PD Disaggregation process, enabling rapid transmission of data such as KVCache between the Prefill and Decode stages.
- vLLM: vLLM can serve as the inference computation component for the Prefill or Decode stage in the PD Disaggregation architecture, providing inference interfaces.
Installation
Prerequisites
Shared storage mounting and parameter plane connectivity need to be checked as follows.
Shared Storage Mounting
- Use
ll /mntto check if storage is automatically mounted; - If not mounted, use
mount -t dpc /kdxf /mntto mount.
- Use
Parameter Plane Connectivity
Check network port up/down status:
for i in {0..7}; do hccn_tool -i $i -link -g; done, it's normal if no port is down;Check if there are network port flapping:
for i in {0..7}; do hccn_tool -i $i -link_stat -g; done, it's normal if there's no flapping;Check tls configuration:
for i in {0..7};do hccn_tool -i $i -tls -g|grep "tls switch";done, normal situation as follows:If there's a non-zero case:
for i in {0..7};do hccn_tool -i $i -tls -s enable 0;done, configure tls to 0.Execute the following command to disable the firewall.
bashsystemctl stop firewalld systemctl disable firewalldParameter plane network connectivity check method: Try to ping the IP address of one card using all cards in the cluster.
5.1 Execute
cat /etc/hccn.confon one server, record adevice ipin the cluster environment, and represent this ip as【ip】in subsequent steps;5.2 Execute on all servers currently in use,
for i in {0..7}; do hccn_tool -i $i -ping -g address 【ip】; done.5.3 If all can ping through, there's no problem with parameter plane network connectivity.
Installation Steps
Image Download
Download Ubuntu 22.04 version image
mindie_dev-2.1.RC1.B152-800I-A2-py311-ubuntu22.04-aarch64.tar.gzand upload to the server.Start Container
2.1 Execute the following command to enter the directory where the image is located and load the image.
bashdocker load -i mindie-XXX.tar.gz2.2 Execute the following command to view the loaded image id.
bashdocker images2.3 Execute the following command to start the container.
bashdocker run --name openfuyao_pd -it -d --net=host --shm-size=500g \ --privileged=true \ -w /home \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device=/dev/devmm_svm \ --entrypoint=bash \ -v /mnt:/mnt \ -v /data:/data \ -v /dev:/dev \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/sbin:/usr/local/sbin \ -v /home:/home \ -v /tmp:/tmp \ -v /opt:/opt \ -v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \ -v XXX \ bc6711a1ef08 # Use the previously viewed image id hereGit Clone
3.1 Execute the following command to enter the container.
bashdocker exec -it openfuyao_pd bash # Replace openfuyao_pd with the actual container id or container name, container information can be viewed with docker ps3.2 Execute the following commands to configure proxy and git clone.
bash# Network configuration export http_proxy=http://[account]:[password]@proxyhk.huawei.com:8080 export https_proxy=http://[account]:[password]@proxyhk.huawei.com:8080 export no_proxy=127.0.0.1,localhost,local,.local git config --global http.sslVerify false # Download Mooncake git clone https://github.com/AscendTransport/Mooncake # Download vLLM git clone https://github.com/vllm-project/vllm # Download vllm-ascend git clone https://gitcode.com/openFuyao/vllm-ascendMooncake Installation
Open the Mooncake/scripts/ascend/dependencies_ascend.sh script file in the cloned folder, find the following two lines of whl package installation code, comment them out to avoid duplicate installation.
textpip install mooncake-wheel/dist/*.whl --force echo -e "Mooncake wheel pipinstall successfully."Then run the installation script.
bashcd Mooncake/scripts/ascend # Comment out the last two lines of whl package installation, duplicate installation will cause errors bash dependencies_ascend.shvLLM Installation
Execute the following command to enter the vllm directory and run the installation script.
bashcd vllm VLLM_TARGET_DEVICE=empty pip install -v -e .vllm-ascend Installation
Enter the vllm-ascend directory, modify the
torch-nupversion in bothrequirements.txtandpyproject.tomlfiles totorch-npu==2.7.1rc1, then run the following installation script.bashcd vllm-ascend/ # Modify requirements.txt, pyproject.toml COMPILE_CUSTOM_KERNELS=0 pip install -v -e ./You can check if the installation was successful with the following command.
bashpip list | grep vllmIf installation is successful, there will be two lines of output similar to the following.
textvllm 0.10.2rc2.dev7+g7c8271cd1.empty vllm_ascend 0.1.dev789+g9b418ede4.d20250902torch-npu Installation
Download the corresponding version of torch_npu package:
torch_npu-2.7.1.dev20250724-cp311-cp311-manylinux_2_28_aarch64.whland upload to the server.Execute the following installation command in the corresponding directory.
bashpip install --no-index --no-deps --force-reinstall torch_npu**
Using PD Disaggregation to Launch Instances
Prerequisites
Mooncake, vLLM, and vllm-ascend have been installed in the container.
Background Information
PD Disaggregation deploys the Prefill and Decode stages in two independent clusters to achieve full resource utilization and throughput improvement.
Usage Restrictions
During the launch process, the Port used by each Prefill and Decode instance cannot conflict, and the tp_size of the Prefill instance should not be less than the tp_size of Decode.
Operation Steps
PD Disaggregation Launch
1.1 Prepare scripts run_prefill.sh and run_decode.sh. Note to replace the model file path
./Qwen3-8Bwith the actual file path.run_prefill.sh
bashexport HCCL_EXEC_TIMEOUT=204 export HCCL_CONNECT_TIMEOUT=120 export HCCL_IF_IP=0.0.0.0 export GLOO_SOCKET_IFNAME="eth0" export TP_SOCKET_IFNAME="eth0" export HCCL_SOCKET_IFNAME="eth0" export ASCEND_RT_VISIBLE_DEVICES=0,1 export VLLM_USE_V1=1 export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH echo "HCCL_IF_IP=$HCCL_IF_IP" vllm serve ./Qwen3-8B \ --host 0.0.0.0\ --port 8107 \ --tensor-parallel-size 2\ --seed 1024 \ --max-model-len 2000 \ --max-num-batched-tokens 2000 \ --trust-remote-code \ --enforce-eager \ --data-parallel-size 1 \ --data-parallel-address localhost \ --data-parallel-rpc-port 9100 \ --gpu-memory-utilization 0.8 \ --kv-transfer-config \ '{"kv_connector": "MooncakeConnectorV1", "kv_buffer_device": "npu", "kv_role": "kv_producer", "kv_parallel_size": 1, "kv_port": "20001", "engine_id": "0", "kv_rank": 0, "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector", "kv_connector_extra_config": { "prefill": { "dp_size": 1, "tp_size": 2 }, "decode": { "dp_size": 1, "tp_size": 2 } } }'run_decode.sh
bashexport HCCL_EXEC_TIMEOUT=204 export HCCL_CONNECT_TIMEOUT=120 export HCCL_IF_IP=0.0.0.0 export GLOO_SOCKET_IFNAME="eth0" export TP_SOCKET_IFNAME="eth0" export HCCL_SOCKET_IFNAME="eth0" export ASCEND_RT_VISIBLE_DEVICES=2,3 export VLLM_USE_V1=1 export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH echo "HCCL_IF_IP=$HCCL_IF_IP" vllm serve ./Qwen3-8B \ --host 0.0.0.0 \ --port 8207 \ --tensor-parallel-size 2\ --seed 1024 \ --max-model-len 2000 \ --max-num-batched-tokens 2000 \ --trust-remote-code \ --enforce-eager \ --data-parallel-size 1 \ --data-parallel-address localhost \ --data-parallel-rpc-port 9100 \ --gpu-memory-utilization 0.8 \ --no-enable-prefix-caching \ --kv-transfer-config \ '{"kv_connector": "MooncakeConnectorV1", "kv_buffer_device": "npu", "kv_role": "kv_consumer", "kv_parallel_size": 1, "kv_port": "20002", "engine_id": "1", "kv_rank": 1, "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector", "kv_connector_extra_config": { "prefill": { "dp_size": 1, "tp_size": 2 }, "decode": { "dp_size": 1, "tp_size": 2 } } }'1.2 Execute the following commands to complete the launch respectively.
bashbash run_prefill.sh bash run_decode.sh1.3 Create a new port to start the service, note that
prefiller-hosts,prefiller-ports,decoder-hosts,decoder-portscorrespond to the settings in the script.bashcd vllm-ascend/examples/disaggregated_prefill_v1/ python load_balance_proxy_server_example.py --host 0.0.0.0 --prefiller-hosts 127.0.0.1 --prefiller-ports 8107 --decoder-hosts 127.0.0.1 --decoder-ports 8207 --port 9090Service Launch Verification
Create a new port for testing. Note to replace the model file path
./Qwen3-8Bwith the actual file path, and the Port accessed for testing needs to be consistent with the Port specified when starting the service (both are 9090 in the example).bashcurl -v -s http://127.0.0.1:9090/v1/completions -H "Content-Type: application/json" -d '{"model": "./Qwen3-8B","prompt": "你好,你是谁?","max_tokens": 256}'Similar responses indicate successful configuration.
Follow-up Operations
To enable batch integration, you need to add the following environment variable in the Prefill instance launch script.
bashexport BATCH_BY_REQUEST_LENGTH="true"And add the following parameter in the
vllm servecommand of the Prefill instance launch script.bash--scheduling-policy "priority"To modify other configurations in the PD Disaggregation launch instance, please refer to vllm-ascend/examples/disaggregated_prefill_v1/mooncake_connector_deployment_guide.md at main · vllm-project/vllm-ascend · GitHub.

