Version: v26.03

Mooncake Ascend NPU Storage Management Layer

Feature Introduction

This feature is adapted for Ascend based on the CacheTier abstract base class in the Mooncake Store new architecture Issue 954, implementing the AscendCacheTier class to support caching KVCache on Ascend NPU devices.

Application Scenarios

In large model inference services, use Mooncake to manage KVCache data in local Ascend device VRAM.

Capability Scope

  • Supports data transfer between VRAM and DRAM, and between VRAM and VRAM (currently only the indirect method VRAM->DRAM->VRAM is registered).
  • Supports VRAM layer management for Ascend NPU devices: supports basic operations such as Init, Allocate, and Free, and can successfully allocate and release memory on Ascend devices. Uses RAII pattern for resource management and ensures thread safety through atomic operations.

Highlights

  • End-to-End Ascend Adaptation: Using CacheTier abstraction as an interface, provides AscendCacheTier implementation, underlying interface with ACL Runtime to complete device memory allocation and copying, business side can use according to unified Tier interface.
  • Thread-Safe Capacity Management: Allocate stage uses atomic CAS for capacity reservation and failure rollback, avoiding over-allocation and statistical inconsistency in concurrent scenarios.
  • RAII Mechanism: AscendBuffer encapsulates device memory and automatically releases it during destruction.

Implementation Principle

Allocate Process

Figure 1 Allocate Process Diagram

Allocate Process DiagramProcess Description:

  1. TieredBackend forwards the Allocate request to the target tier.
  2. AscendCacheTier checks initialization status.
  3. Uses CAS atomic operation to reserve space (check then update, retry on failure).
  4. Calls AllocateDeviceMemory to allocate device memory.
  5. If device allocation fails, rolls back the reserved space through fetch_sub.
  6. Creates AscendBuffer to encapsulate device memory, returns to caller through DataSource.
  7. AscendBuffer implements RAII, automatically releasing device memory during destruction.

Free Process

Figure 2 Free Process Diagram

Free Process Diagram

Process Description:

  1. TieredBackend forwards the Free request to the target tier.
  2. Checks if buffer is empty (empty buffer is a safe no-op).
  3. Gets the size of memory to be released.
  4. When DataSource goes out of scope, AscendBuffer destructor is called.
  5. ReleaseMemory() calls aclrtFree to release device memory.
  6. Updates current_usage_ through atomic operation.

CopyDramToAscend Process

Figure 3 CopyDramToAscend Process Diagram

CopyDramToAscend Process DiagramProcess Description:

  1. DataCopier finds the corresponding copy function based on source/destination MemoryType.
  2. CopyDramToAscend function validates parameter validity.
  3. Ensures target buffer is AscendBuffer type through dynamic_cast.
  4. Gets device ID and device pointer from AscendUnifiedPointer.
  5. Sets device context and executes aclrtMemcpy (HOST_TO_DEVICE).
  6. Copy function is automatically registered during static initialization phase through CopierRegistrar.

Depends on the Tiered Backend feature to manage and initialize all CacheTier instances:

  • Provides a global tier view, including used memory, priority, tags, and mappings between keys for each tier.
  • Implements higher-level data operation APIs based on CacheTier's API, such as data migration between two tier layers.

Using AscendCacheTier

Prerequisites

  • Hardware Requirements: Requires Ascend NPU devices.
  • Software Requirements: Requires ACL Runtime library support.
  • Compilation Requirements: CMake compilation requires adding the option -DUSE_ASCEND_CACHE_TIER=ON to enable AscendCacheTier functionality.
  • Environment Requirements: mooncake-master service needs to be started first.

Background Information

  • Usage Scenario: In Ascend NPU inference scenarios, use Mooncake to manage KVCache data in local Ascend device VRAM.
  • Basic Principle: AscendCacheTier, as the Ascend implementation of CacheTier, is responsible for Init/Allocate/Free of device memory; capacity is reserved through atomic CAS to ensure thread safety; underlying calls to ACL Runtime (such as aclrtMemcpy) complete HOST/DEVICE or DEVICE/DEVICE transfer.
  • Important Considerations:
    • Device Context: Device memory operations and copying will set device_id (such as aclrtSetDevice), multi-card scenarios need to ensure that the configured device_id is consistent with the actual process binding.
    • Lifecycle and RAII: Device memory is encapsulated by AscendBuffer and released during destruction; users should avoid accessing buffer pointers after DataSource is released.
    • Capacity Assessment: capacity is a hard limit, unexpected temporary buffers/copies may amplify peak usage, it is recommended to use current_usage_ monitoring for capacity planning.

Usage Restrictions

  • Scope of Application: Only available when compiled with USE_ASCEND_CACHE_TIER=ON enabled and the runtime environment has ACL Runtime.

Operation Steps

AscendCacheTier configuration and initialization relies on TieredBackend, described in JSON format. Can support single cache layer or multi-layer joint configuration. The configuration process mainly includes the following steps:

  1. Prepare JSON configuration content. Example (single-layer NPU configuration):

    jsonc
    {
        "tiers": [
            {
                "type": "ASCEND_NPU",
                "capacity": 536870912,      // 512MB
                "priority": 100,            // Priority, higher value means higher priority
                "device_id": 0,             // Ascend device ID
                "tags": ["npu", "fast"]     // Optional tags
            }
        ]
    }
  2. Pass JSON configuration when deploying RealClient.

    RealClient deployment is divided into two methods: separated deployment and integrated deployment.

    • Separated Deployment

      In separated deployment scenarios, RealClient is started through the binary compilation product mooncake_client. JSON configuration can be passed directly through the startup parameter tiered_backend_config.

      Startup parameter example:

      bash
      ./mooncake_client --master_server_address 127.0.0.1:50051 \
      --port 50052 \
      --client_rpc_port 12345 \
      --deployment_mode P2P \
      --tiered_backend_config '{"tiers":[{"type":"DRAM","capacity":536870912,"priority":100,"allocator_type":"OFFSET"}]}' \
      --metadata_server P2PHANDSHAKE

      You can also set tiered_backend_config to empty and save JSON in the environment variable MOONCAKE_TIRED_CONFIG.

    • Integrated Deployment

      In integrated deployment scenarios, RealClient is started through script calling setup_p2p_real_client, assigning JSON configuration to the input parameter tiered_backend_config_json to initialize AscendCacheTier.

      Script example:

      python
      from mooncake.store import MooncakeDistributedStore
      import json
      
      store = MooncakeDistributedStore()
      
      tiered_backend_config = {
          "tiers": [
              {
                  "type": "DRAM",
                  "capacity": 64 * 1024 * 1024,
                  "priority": 10
              },
              {
                  "type": "ASCEND_NPU",
                  "capacity": 512 * 1024 * 1024,
                  "priority": 100,
                  "device_id": 0,
                  "tags": ["npu", "fast"]
              }
          ]
      }
      
      ret = store.setup_p2p_real_client(
          local_hostname="10.10.10.21",
          metadata_server="P2PHANDSHAKE",
          tiered_backend_config_json=json.dumps(tiered_backend_config),
          local_buffer_size=16 * 1024 * 1024,
          protocol="tcp",
          rdma_devices="",
          master_server_addr="127.0.0.1:50051",
          client_rpc_port=12345,
          client_rpc_thread_num=16,
      )
      
      print("setup ret =", ret)

Follow-up Operations

For follow-up operations, refer to Mooncake-Store Documentation.