Version: v26.03

Mooncake P2P Architecture Client Remote Invocation

Feature Introduction

This feature aims to implement cross-machine data read/write capabilities between Clients for the Mooncake Store V3 architecture (ISSUE 1209). The implementation completes cross-machine read/write through RPC metadata communication + TransferEngine data transmission. Core components include:

  • ClientRpcService: RPC server, handling remote data read/write requests.
  • DataManager: Data management layer, coordinating local/remote data operations.
  • PeerClient: RPC client, initiating asynchronous cross-machine data read/write requests.

Application Scenarios

In large model inference services, using Mooncake V3 architecture for cross-machine read/write of remote data.

Capability Scope

  • Implement cross-machine data read/write RPC between Clients: Supports secure cross-machine data read/write and correctly handles RPC timeouts (timeout period 10s). Ensures concurrent access safety through key-level read/write locks (Lock Striping). Adopts RAII pattern to guarantee operation atomicity.
  • Support multiple storage media: Supports DRAM and non-DRAM (such as SSD) Tiers, with non-DRAM data transferred through temporary buffers.

Key Features

  • Control plane/data plane decoupling: RPC is only responsible for metadata and scheduling, data is transmitted cross-machine through TransferEngine, reducing bandwidth pressure on RPC links.
  • Unified entry and consistent concurrency control: Local/remote management is uniformly handled by DataManager, adopting key-level read/write locks (Lock Striping) to ensure concurrent safety.
  • Multi-media transparency and robust failure handling: Supports DRAM/non-DRAM Tier, non-DRAM scenarios transfer through temporary DRAM buffers; correctly handles RPC timeout and object non-existence error returns.

Implementation Principle

Component View

Figure 1 Component View

Component View

Cross-machine Read Operation Flow

Figure 2 Cross-machine Read Operation Flow Diagram

Cross-machine Read Operation Flow Diagram

Flow Description:

  1. Client A initiates async ReadRemoteData RPC request through PeerClient.
  2. Client B's ClientRpcService receives the request and forwards it to DataManager.
  3. DataManager obtains data handle from TieredBackend.
  4. If data is not in DRAM, it needs to be copied to temporary buffer first.
  5. Write data to Client A's target buffer via TransferEngine using RDMA WRITE.
  6. Return success result.

Cross-machine Write Operation Flow

Figure 3 Cross-machine Write Operation Flow Diagram

Cross-machine Write Operation Flow Diagram

Flow Description:

  1. Client A initiates async WriteRemoteData RPC request through PeerClient.
  2. Client B's ClientRpcService receives the request and forwards it to DataManager.
  3. DataManager obtains the Key's mutex lock (write mutual exclusion).
  4. Allocate storage space from TieredBackend, obtain handle.
  5. If target is not DRAM, temporary buffer needs to be used.
  6. Read data from Client A via TransferEngine using RDMA READ.
  7. If target is not DRAM, copy data from temporary buffer to target storage.
  8. Call TieredBackend's Commit to submit metadata update.
  9. Return success result.

RAII Guarantee: If transfer or commit fails, the handle will automatically release allocated resources during destruction.

  • Depends on TieredBackend feature to manage local storage.
  • Depends on TransferEngine feature to implement cross-node data transmission.

Use P2P Architecture Client Remote Invocation

Prerequisites

  • Hardware Requirements: RDMA/TCP network card required.
  • Environment Requirements: mooncake-master service must be started first.

Background Information

  • Usage Scenario: In P2P deployment mode, inference processes need to cross-machine read/write remote KVCache or intermediate results. Callers want to read/write remote data by Key as if accessing locally.
  • Basic Principle: Cross-machine read/write is divided into two segments: metadata and control information is exchanged asynchronously via RPC between PC and RPC; actual data plane is handled by TransferEngine for cross-machine transfer (such as RDMA WRITE/READ), avoiding passing large data directly through RPC channel.

Usage Limitations

  • Deployment and Connectivity Limitations: Only applicable to P2P deployment mode; cluster network must meet bidirectional connectivity and port opening requirements for TCP or RDMA channels.

Operation Steps

Configure DataManger lock shard count through environment variable MOONCAKE_DM_LOCK_SHARD_COUNT (default 1024 shards if not set). Configure RPC transport protocol to "rdma" or "tcp" through environment variable MC_RPC_PROTOCOL (default "tcp" if not set).

When deploying Master, configure startup parameters:

  • deployment_mode configure to P2P mode.

When deploying Client, configure startup parameters:

  • deployment_mode configure to P2P mode.
  • client_rpc_port is the RPC server port number, default value is 12345.
  • rpc_thread_num is the number of RPC threads, default value is 2.

Startup example:

bash
./mooncake_master --deployment_mode P2P --rpc_port 50051

./mooncake_client \
  --master_server_address 127.0.0.1:50051 \
  --port 50052 --client_rpc_port 12345 \
  --deployment_mode P2P \
  --tiered_backend_config '{"tiers":[{"type":"DRAM","capacity":536870912,"priority":100,"allocator_type":"OFFSET"}]}'

Subsequent Operations

Subsequent operations can refer to Mooncake-Store documentation.