Mooncake Ascend NPU Storage Management Layer
Feature Introduction
This feature is adapted for Ascend based on the CacheTier abstract base class in the Mooncake Store new architecture Issue 954, implementing the AscendCacheTier class to support caching KVCache on Ascend NPU devices.
Application Scenarios
In large model inference services, use Mooncake to manage KVCache data in local Ascend device VRAM.
Capability Scope
- Supports data transfer between VRAM and DRAM, and between VRAM and VRAM (currently only the indirect method VRAM->DRAM->VRAM is registered).
- Supports VRAM layer management for Ascend NPU devices: supports basic operations such as Init, Allocate, and Free, and can successfully allocate and release memory on Ascend devices. Uses RAII pattern for resource management and ensures thread safety through atomic operations.
Highlights
- End-to-End Ascend Adaptation: Using CacheTier abstraction as an interface, provides AscendCacheTier implementation, underlying interface with ACL Runtime to complete device memory allocation and copying, business side can use according to unified Tier interface.
- Thread-Safe Capacity Management: Allocate stage uses atomic CAS for capacity reservation and failure rollback, avoiding over-allocation and statistical inconsistency in concurrent scenarios.
- RAII Mechanism: AscendBuffer encapsulates device memory and automatically releases it during destruction.
Implementation Principle
Allocate Process
Figure 1 Allocate Process Diagram
Process Description:
- TieredBackend forwards the Allocate request to the target tier.
- AscendCacheTier checks initialization status.
- Uses CAS atomic operation to reserve space (check then update, retry on failure).
- Calls
AllocateDeviceMemoryto allocate device memory. - If device allocation fails, rolls back the reserved space through
fetch_sub. - Creates
AscendBufferto encapsulate device memory, returns to caller throughDataSource. AscendBufferimplements RAII, automatically releasing device memory during destruction.
Free Process
Figure 2 Free Process Diagram
Process Description:
- TieredBackend forwards the Free request to the target tier.
- Checks if buffer is empty (empty buffer is a safe no-op).
- Gets the size of memory to be released.
- When
DataSourcegoes out of scope,AscendBufferdestructor is called. ReleaseMemory()callsaclrtFreeto release device memory.- Updates
current_usage_through atomic operation.
CopyDramToAscend Process
Figure 3 CopyDramToAscend Process Diagram
Process Description:
DataCopierfinds the corresponding copy function based on source/destinationMemoryType.CopyDramToAscendfunction validates parameter validity.- Ensures target buffer is
AscendBuffertype throughdynamic_cast. - Gets device ID and device pointer from
AscendUnifiedPointer. - Sets device context and executes
aclrtMemcpy(HOST_TO_DEVICE). - Copy function is automatically registered during static initialization phase through
CopierRegistrar.
Relationship with Related Features
Depends on the Tiered Backend feature to manage and initialize all CacheTier instances:
- Provides a global tier view, including used memory, priority, tags, and mappings between keys for each tier.
- Implements higher-level data operation APIs based on CacheTier's API, such as data migration between two tier layers.
Using AscendCacheTier
Prerequisites
- Hardware Requirements: Requires Ascend NPU devices.
- Software Requirements: Requires ACL Runtime library support.
- Compilation Requirements: CMake compilation requires adding the option
-DUSE_ASCEND_CACHE_TIER=ONto enable AscendCacheTier functionality. - Environment Requirements: mooncake-master service needs to be started first.
Background Information
- Usage Scenario: In Ascend NPU inference scenarios, use Mooncake to manage KVCache data in local Ascend device VRAM.
- Basic Principle: AscendCacheTier, as the Ascend implementation of CacheTier, is responsible for Init/Allocate/Free of device memory; capacity is reserved through atomic CAS to ensure thread safety; underlying calls to ACL Runtime (such as
aclrtMemcpy) complete HOST/DEVICE or DEVICE/DEVICE transfer. - Important Considerations:
- Device Context: Device memory operations and copying will set device_id (such as
aclrtSetDevice), multi-card scenarios need to ensure that the configured device_id is consistent with the actual process binding. - Lifecycle and RAII: Device memory is encapsulated by AscendBuffer and released during destruction; users should avoid accessing buffer pointers after DataSource is released.
- Capacity Assessment: capacity is a hard limit, unexpected temporary buffers/copies may amplify peak usage, it is recommended to use current_usage_ monitoring for capacity planning.
- Device Context: Device memory operations and copying will set device_id (such as
Usage Restrictions
- Scope of Application: Only available when compiled with
USE_ASCEND_CACHE_TIER=ONenabled and the runtime environment has ACL Runtime.
Operation Steps
AscendCacheTier configuration and initialization relies on TieredBackend, described in JSON format. Can support single cache layer or multi-layer joint configuration. The configuration process mainly includes the following steps:
Prepare JSON configuration content. Example (single-layer NPU configuration):
jsonc{ "tiers": [ { "type": "ASCEND_NPU", "capacity": 536870912, // 512MB "priority": 100, // Priority, higher value means higher priority "device_id": 0, // Ascend device ID "tags": ["npu", "fast"] // Optional tags } ] }Pass JSON configuration when deploying RealClient.
RealClient deployment is divided into two methods: separated deployment and integrated deployment.
Separated Deployment
In separated deployment scenarios, RealClient is started through the binary compilation product mooncake_client. JSON configuration can be passed directly through the startup parameter tiered_backend_config.
Startup parameter example:
bash./mooncake_client --master_server_address 127.0.0.1:50051 \ --port 50052 \ --client_rpc_port 12345 \ --deployment_mode P2P \ --tiered_backend_config '{"tiers":[{"type":"DRAM","capacity":536870912,"priority":100,"allocator_type":"OFFSET"}]}' \ --metadata_server P2PHANDSHAKEYou can also set tiered_backend_config to empty and save JSON in the environment variable
MOONCAKE_TIRED_CONFIG.Integrated Deployment
In integrated deployment scenarios, RealClient is started through script calling setup_p2p_real_client, assigning JSON configuration to the input parameter tiered_backend_config_json to initialize AscendCacheTier.
Script example:
pythonfrom mooncake.store import MooncakeDistributedStore import json store = MooncakeDistributedStore() tiered_backend_config = { "tiers": [ { "type": "DRAM", "capacity": 64 * 1024 * 1024, "priority": 10 }, { "type": "ASCEND_NPU", "capacity": 512 * 1024 * 1024, "priority": 100, "device_id": 0, "tags": ["npu", "fast"] } ] } ret = store.setup_p2p_real_client( local_hostname="10.10.10.21", metadata_server="P2PHANDSHAKE", tiered_backend_config_json=json.dumps(tiered_backend_config), local_buffer_size=16 * 1024 * 1024, protocol="tcp", rdma_devices="", master_server_addr="127.0.0.1:50051", client_rpc_port=12345, client_rpc_thread_num=16, ) print("setup ret =", ret)
Follow-up Operations
For follow-up operations, refer to Mooncake-Store Documentation.


