Version: v26.03

AI Inference Hermes Routing Plugin Development Guide

Feature Overview

Hermes-router is a Kubernetes (K8s) native AI inference intelligent routing solution that receives user inference requests and forwards them to appropriate inference service backends.

As an EPP plugin under the GIE framework, Hermes-router focuses only on the request forwarding process. This guide introduces how developers can develop custom routing strategies based on the EPP framework.

Constraints and Limitations

Inference Engine Limitations

Hermes-router's support for inference engines is consistent with GIE. However, some plugins depend on cache-indexer to calculate global KVCache hit rate, and cache-indexer in v25.12 only supports vLLM. Therefore, at this stage, these plugins can indirectly only be used when the inference engine is vLLM, including the following plugins:

  • scorer-aggregate-kv-cache-aware
  • scorer-pd-kv-cache-aware

Other plugins theoretically support multiple inference engines, but currently only scenarios with vLLM as the inference engine have been validated.

Development Limitations

In this document, one EPP instance represents one routing strategy, and a routing strategy typically consists of multiple types of plugins. Plugin development must comply with the GIE Framework Specification.

Deployment Limitations

  • Only supports deployment and running in Kubernetes environments.
  • Plugins need to be developed locally and built into images before they can be deployed and run.
  • If you need to change the routing strategy configuration, the EPP must be restarted after modifying the configuration for the changes to take effect.

Environment Preparation

Environment Requirements

Hardware Requirements

The Hermes-router development environment has no special hardware requirements. It is recommended to configure as follows.

  • CPU: 4 cores or more
  • Memory: 8GB or more
  • Disk: 20GB or more available space

Software Requirements

  • Operating System: Linux
  • Go Environment: Go 1.21 or higher
  • Docker: Docker 20.10+ or compatible container runtime (such as nerdctl)
  • Kubernetes Cluster: For deployment and testing
  • kubectl: For interacting with K8s cluster
  • Helm: Version 3.0+, for deploying Hermes-router

Dependency Components

  • Basic Dependencies

    • Open-source Gateway: Needs to support Gateway API Inference Extension, such as Istio v1.28.
  • Additional Dependencies

    If you want to use the KVCache aware related routing strategies provided by Hermes-router for development, there are the following additional dependencies:

    • KVCache Global Management Component: Provides global KVCache awareness and hit rate calculation, such as cache-indexer v25.12.

Setting Up the Environment

  1. Clone the code repository.

    bash
    # Clone hermes-router main repository
    git clone https://gitcode.com/openFuyao/hermes-router.git
    cd hermes-router
  2. Configure Go development environment.

    bash
    # Check Go version
    go version
    
    # Set Go proxy (optional, accelerate dependency download)
    go env -w GOPROXY=https://goproxy.cn,direct
  3. Install Istio and GIE framework.

    bash
    # Download Istio
    curl -L https://istio.io/downloadIstio | sh -
    cd istio-*
    
    # Install Istio (enable GIE support)
    istioctl install -y \
    --set tag=<ISTIO_TAG> \
    --set hub=gcr.io/istio-testing \
    --set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true
    
    # Install Gateway API CRDs
    kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml
    
    # Install GIE CRDs (adjust according to actual version)
    kubectl apply -f <GIE_CRDs_YAML>
  4. Build hermes-router image.

    bash
    cd hermes-router
    
    # Build image
    docker build -t hermes-router:dev -f Dockerfile.epp .
    
    # Or use nerdctl
    nerdctl build -t hermes-router:dev -f Dockerfile.epp .

Verify Environment

  1. Verify Istio installation.

    bash
    # Check Istio components
    kubectl get pods -n istio-system
    
    # Expected output: All Pods status is Running
  2. Verify GIE CRDs.

    bash
    # Check InferencePool CRD
    kubectl get crd inferencepools.inference.networking.x-k8s.io
  3. Deploy hermes-router for verification.

    bash
    cd hermes-router
    
    # Update Helm dependencies
    helm dependency update examples/1_pd_bucket/charts/hermes-router
    
    # Deploy hermes-router (using the simplest configuration)
    helm upgrade --install vllm-qwen-qwen3-8b \
    examples/1_pd_bucket/charts/hermes-router \
    --namespace hermes-test --create-namespace \
    -f examples/profiles/epp-random-pd-bucket.yaml \
    --set inferencepool.provider.name=istio \
    --set image.repository=hermes-router \
    --set image.tag=dev
    
    # Check deployment status
    kubectl get pods -n hermes-test
    kubectl get svc -n hermes-test

Expected Results:

  • Pod status is Running.
  • Service is created normally.
  • No error logs.
bash
[root@master hermes-router]# kd get all
NAME                                          READY   STATUS    RESTARTS   
pod/vllm-pd-decode-5bb99b6764-cs52k           1/1     Running   0          
pod/vllm-pd-prefill-6db4c89f78-f7csp          1/1     Running   0          
pod/vllm-pd-proxy-5f8bfcf495-p5vpj            1/1     Running   0          
pod/vllm-qwen-qwen3-8b-epp-5fd87d6d59-r2xpv   1/1     Running   0          

NAME                             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
service/vllm-pd-proxy            ClusterIP   ******         <none>        8000/TCP,5557/TCP   16h
service/vllm-qwen-qwen3-8b-epp   ClusterIP   ******         <none>        9002/TCP,9090/TCP   83s

NAME                                     READY   UP-TO-DATE   AVAILABLE   
deployment.apps/vllm-pd-decode           1/1     1            1           
deployment.apps/vllm-pd-prefill          1/1     1            1           
deployment.apps/vllm-pd-proxy            1/1     1            1           
deployment.apps/vllm-qwen-qwen3-8b-epp   1/1     1            1           

NAME                                                DESIRED   CURRENT   READY   
replicaset.apps/vllm-pd-decode-5bb99b6764           1         1         1       
replicaset.apps/vllm-pd-prefill-6db4c89f78          1         1         1       
replicaset.apps/vllm-pd-proxy-5f8bfcf495            1         1         1       
replicaset.apps/vllm-qwen-qwen3-8b-epp-5fd87d6d59   1         1         1

Developing Custom Routing Strategies

Usage Scenario Overview

When preset routing strategies cannot meet specific business needs, developers can extend routing capabilities by implementing custom plugins. Typical scenarios include:

  • Business-specific routing rules: Routing based on custom attributes such as business labels, user IDs, etc.
  • Performance optimization needs: Optimize routing algorithms for specific hardware or network environments.
  • Hybrid architecture support: Support hybrid deployment of aggregated architecture and PD disaggregated architecture.

System Architecture

Hermes-router adopts a plugin architecture, implementing intelligent routing through the collaboration of different types of plugins. Plugins are mainly divided into four categories: Filter plugins filter candidate endpoints, Scorer plugins score endpoints, Picker plugins select the final endpoint from scoring results, and PreRequest plugins modify requests before forwarding. Detailed descriptions of each plugin type are as follows:

Table 1 Hermes-router Plugin Type Descriptions

Plugin TypeFunction DescriptionExample
Filter PluginFilter candidate endpoint listfilter-by-pd-label: Filter based on PD labels.
Scorer PluginScore endpointsscorer-aggregate-kv-cache-aware: Score based on KV Cache.
scorer-pd-bucket: Score based on Bucket load.
Picker PluginSelect final endpoint from scoring resultspicker-min-random: Select random instance with lowest score
picker-pd-kv-cache-aware: Intelligent selection plugin for PD architecture.
PreRequest PluginModify request before forwardingpd-header-handler: Add request headers required for PD architecture.

Plugin Execution Flow

Plugins execute in the order configured in schedulingProfiles:

  1. Filter Phase: All Filter plugins execute in sequence, gradually narrowing the candidate endpoint range.
  2. Scorer Phase: All Scorer plugins score candidate endpoints (weight can be configured).
  3. Picker Phase: Picker plugin selects the final endpoint based on scoring results.
  4. PreRequest Phase: PreRequest plugin modifies the request before forwarding.

The following links provide relevant documentation required for developing custom routing strategies:

  • hermes-router code repository: Hermes-router's source code repository, containing all plugin implementation examples and development framework. Developers can refer to existing plugin code for development.
  • GIE Framework Specification: Official specification document for Gateway API Inference Extension (GIE), defining the interface specification and development standards for EPP plugins. This is the specification that must be followed when developing EPP plugins.
  • EPP Architecture Proposal: Endpoint Picker Plugin (EPP) architecture proposal, detailing the architectural design, execution flow, and extension mechanism of EPP plugins, helping developers deeply understand plugin working principles.

Development Steps

  1. Design routing strategy.

    Developers design routing strategies and specific plugins themselves. This guide skips this step.

  2. Develop routing plugins.

    After completing the design of the routing strategy, developers need to split the processing flow into several EPP-compliant plugins and implement them. The main plugin types are Filter, Scorer, Picker, PreRequest, etc.

    The following uses a Filter plugin for filtering Pods as an example to demonstrate the EPP plugin development process:

    2.1. Create Filter code file.

    Hermes-router stores EPP plugins by category. It is recommended to create a new Filter plugin under pkg/plugins/filter.

    2.2. Define Filter plugin struct.

    go
    type MyFilter struct {  
       typedName plugins.TypedName  
       // ... other member variables 
     }

    2.3. Implement struct constructor.

    go
    func NewMyFilter(...){
       // ... initialize member variables
       return &MyFilter{...}
    }

    2.4. Implement factory function.

    go
    func ByMyFilterFactory(...){
       // ... initialize Filter member variables based on parameters passed by GIE
       return NewMyFilter(...)
     }

    2.5. Implement the Filter() method in the Filter interface.

    go
     func (m *MyFilter) Filter(_ context.Context, _ *types.CycleState, _ *types.LLMRequest, pods []types.Pod) []types.Pod {}

    2.6. Register the new Filter in pkg/plugins/register.go.

    go
    func RegisterAllPlugins() {  
       plugins.Register(filter.MyFilter, filter.ByMyFilterFactory)  
       // ... other plugins  
     }

    At this point, a Filter plugin development is complete. Developers can develop other types of EPP plugins following the same process.

  3. Define routing strategy.

    To apply the routing strategy, developers need to configure in the yaml file with resource type InferencePool:

    • Declare the routing strategy name in the inferenceExtension.pluginsConfigFile field.
    • Declare the specific routing strategy configuration in the inferenceExtension.pluginsCustomConfig field.
    yaml
    # Configuration example for example-strategy routing strategy
    inferencepool:
    inferenceExtension:
       pluginsConfigFile: "example_strategy.yaml"
       pluginsCustomConfig:
       exaexample_strategymple.yaml: |
          apiVersion: inference.networking.x-k8s.io/v1alpha1
          kind: EndpointPickerConfig
          plugins:
          # ... plugins

At this point, developers have completed the development of the routing strategy.

Debugging and Verification

Use kubectl port-forward for port forwarding.

  1. Execute the following command to find the Gateway Service name.

    shell
    kubectl get svc -n <NAMESPACE> -l gateway.networking.k8s.io/gateway-name=inference-gateway
  2. Execute the following command to set up port forwarding.

    shell
    kubectl port-forward -n <NAMESPACE> service/<gateway-service-name> 8000:80
  3. Execute the following command to send a request.

    shell
    curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
       "model": "Qwen/Qwen3-8B",
       "messages": [
          {"role": "user", "content": "Hello"}
       ],
       "max_tokens": 100,
       "temperature": 0.7,
       "stream": false
    }'

After development is complete, please refer to the Installation section in the User Guide for deployment and verification work.