AI Inference Hermes Routing Plugin Development Guide

Feature Overview

Hermes-router is a Kubernetes (K8s) native AI inference intelligent routing solution that receives user inference requests and forwards them to appropriate inference service backends.

As an EPP plugin under the GIE framework, Hermes-router focuses only on the request forwarding process. This guide introduces how developers can develop custom routing strategies based on the EPP framework.

Constraints and Limitations

Inference Engine Limitations

Hermes-router's support for inference engines is consistent with GIE. However, some plugins depend on cache-indexer to calculate global KVCache hit rate, and cache-indexer in v25.12 only supports vLLM. Therefore, at this stage, these plugins can indirectly only be used when the inference engine is vLLM, including the following plugins:

scorer-aggregate-kv-cache-aware
scorer-pd-kv-cache-aware

Other plugins theoretically support multiple inference engines, but currently only scenarios with vLLM as the inference engine have been validated.

Development Limitations

In this document, one EPP instance represents one routing strategy, and a routing strategy typically consists of multiple types of plugins. Plugin development must comply with the GIE Framework Specification.

Deployment Limitations

Only supports deployment and running in Kubernetes environments.
Plugins need to be developed locally and built into images before they can be deployed and run.
If you need to change the routing strategy configuration, the EPP must be restarted after modifying the configuration for the changes to take effect.

Environment Preparation

Environment Requirements

Hardware Requirements

The Hermes-router development environment has no special hardware requirements. It is recommended to configure as follows.

CPU: 4 cores or more
Memory: 8GB or more
Disk: 20GB or more available space

Software Requirements

Operating System: Linux
Go Environment: Go 1.21 or higher
Docker: Docker 20.10+ or compatible container runtime (such as nerdctl)
Kubernetes Cluster: For deployment and testing
kubectl: For interacting with K8s cluster
Helm: Version 3.0+, for deploying Hermes-router

Dependency Components

Basic Dependencies
- Open-source Gateway: Needs to support Gateway API Inference Extension, such as Istio v1.28.
Additional Dependencies
If you want to use the KVCache aware related routing strategies provided by Hermes-router for development, there are the following additional dependencies:
- KVCache Global Management Component: Provides global KVCache awareness and hit rate calculation, such as cache-indexer v25.12.

Setting Up the Environment

Clone the code repository.

bash

# Clone hermes-router main repository
git clone https://gitcode.com/openFuyao/hermes-router.git
cd hermes-router

Configure Go development environment.

bash

# Check Go version
go version

# Set Go proxy (optional, accelerate dependency download)
go env -w GOPROXY=https://goproxy.cn,direct

Install Istio and GIE framework.

bash

# Download Istio
curl -L https://istio.io/downloadIstio | sh -
cd istio-*

# Install Istio (enable GIE support)
istioctl install -y \
--set tag=<ISTIO_TAG> \
--set hub=gcr.io/istio-testing \
--set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true

# Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml

# Install GIE CRDs (adjust according to actual version)
kubectl apply -f <GIE_CRDs_YAML>

Build hermes-router image.

bash

cd hermes-router

# Build image
docker build -t hermes-router:dev -f Dockerfile.epp .

# Or use nerdctl
nerdctl build -t hermes-router:dev -f Dockerfile.epp .

Verify Environment

Verify Istio installation.

bash

# Check Istio components
kubectl get pods -n istio-system

# Expected output: All Pods status is Running

Verify GIE CRDs.

bash

# Check InferencePool CRD
kubectl get crd inferencepools.inference.networking.x-k8s.io

Deploy hermes-router for verification.

bash

cd hermes-router

# Update Helm dependencies
helm dependency update examples/1_pd_bucket/charts/hermes-router

# Deploy hermes-router (using the simplest configuration)
helm upgrade --install vllm-qwen-qwen3-8b \
examples/1_pd_bucket/charts/hermes-router \
--namespace hermes-test --create-namespace \
-f examples/profiles/epp-random-pd-bucket.yaml \
--set inferencepool.provider.name=istio \
--set image.repository=hermes-router \
--set image.tag=dev

# Check deployment status
kubectl get pods -n hermes-test
kubectl get svc -n hermes-test

Expected Results:

Pod status is Running.
Service is created normally.
No error logs.

bash

[root@master hermes-router]# kd get all
NAME                                          READY   STATUS    RESTARTS   
pod/vllm-pd-decode-5bb99b6764-cs52k           1/1     Running   0          
pod/vllm-pd-prefill-6db4c89f78-f7csp          1/1     Running   0          
pod/vllm-pd-proxy-5f8bfcf495-p5vpj            1/1     Running   0          
pod/vllm-qwen-qwen3-8b-epp-5fd87d6d59-r2xpv   1/1     Running   0          

NAME                             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
service/vllm-pd-proxy            ClusterIP   ******         <none>        8000/TCP,5557/TCP   16h
service/vllm-qwen-qwen3-8b-epp   ClusterIP   ******         <none>        9002/TCP,9090/TCP   83s

NAME                                     READY   UP-TO-DATE   AVAILABLE   
deployment.apps/vllm-pd-decode           1/1     1            1           
deployment.apps/vllm-pd-prefill          1/1     1            1           
deployment.apps/vllm-pd-proxy            1/1     1            1           
deployment.apps/vllm-qwen-qwen3-8b-epp   1/1     1            1           

NAME                                                DESIRED   CURRENT   READY   
replicaset.apps/vllm-pd-decode-5bb99b6764           1         1         1       
replicaset.apps/vllm-pd-prefill-6db4c89f78          1         1         1       
replicaset.apps/vllm-pd-proxy-5f8bfcf495            1         1         1       
replicaset.apps/vllm-qwen-qwen3-8b-epp-5fd87d6d59   1         1         1

Developing Custom Routing Strategies

Usage Scenario Overview

When preset routing strategies cannot meet specific business needs, developers can extend routing capabilities by implementing custom plugins. Typical scenarios include:

Business-specific routing rules: Routing based on custom attributes such as business labels, user IDs, etc.
Performance optimization needs: Optimize routing algorithms for specific hardware or network environments.
Hybrid architecture support: Support hybrid deployment of aggregated architecture and PD disaggregated architecture.

System Architecture

Hermes-router adopts a plugin architecture, implementing intelligent routing through the collaboration of different types of plugins. Plugins are mainly divided into four categories: Filter plugins filter candidate endpoints, Scorer plugins score endpoints, Picker plugins select the final endpoint from scoring results, and PreRequest plugins modify requests before forwarding. Detailed descriptions of each plugin type are as follows:

Table 1 Hermes-router Plugin Type Descriptions

Plugin Type	Function Description	Example
Filter Plugin	Filter candidate endpoint list	`filter-by-pd-label`: Filter based on PD labels.
Scorer Plugin	Score endpoints	`scorer-aggregate-kv-cache-aware`: Score based on KV Cache. `scorer-pd-bucket`: Score based on Bucket load.
Picker Plugin	Select final endpoint from scoring results	`picker-min-random`: Select random instance with lowest score `picker-pd-kv-cache-aware`: Intelligent selection plugin for PD architecture.
PreRequest Plugin	Modify request before forwarding	`pd-header-handler`: Add request headers required for PD architecture.

Plugin Execution Flow

Plugins execute in the order configured in schedulingProfiles:

Filter Phase: All Filter plugins execute in sequence, gradually narrowing the candidate endpoint range.
Scorer Phase: All Scorer plugins score candidate endpoints (weight can be configured).
Picker Phase: Picker plugin selects the final endpoint based on scoring results.
PreRequest Phase: PreRequest plugin modifies the request before forwarding.

The following links provide relevant documentation required for developing custom routing strategies:

hermes-router code repository: Hermes-router's source code repository, containing all plugin implementation examples and development framework. Developers can refer to existing plugin code for development.
GIE Framework Specification: Official specification document for Gateway API Inference Extension (GIE), defining the interface specification and development standards for EPP plugins. This is the specification that must be followed when developing EPP plugins.
EPP Architecture Proposal: Endpoint Picker Plugin (EPP) architecture proposal, detailing the architectural design, execution flow, and extension mechanism of EPP plugins, helping developers deeply understand plugin working principles.

Development Steps

Design routing strategy.
Developers design routing strategies and specific plugins themselves. This guide skips this step.
Develop routing plugins.
After completing the design of the routing strategy, developers need to split the processing flow into several EPP-compliant plugins and implement them. The main plugin types are Filter, Scorer, Picker, PreRequest, etc.
The following uses a Filter plugin for filtering Pods as an example to demonstrate the EPP plugin development process:
2.1. Create Filter code file.
Hermes-router stores EPP plugins by category. It is recommended to create a new Filter plugin under pkg/plugins/filter.
2.2. Define Filter plugin struct.
go
```
type MyFilter struct {  
   typedName plugins.TypedName  
   // ... other member variables 
 }
```
2.3. Implement struct constructor.
go
```
func NewMyFilter(...){
   // ... initialize member variables
   return &MyFilter{...}
}
```
2.4. Implement factory function.
go
```
func ByMyFilterFactory(...){
   // ... initialize Filter member variables based on parameters passed by GIE
   return NewMyFilter(...)
 }
```
2.5. Implement the Filter() method in the Filter interface.
go
```
 func (m *MyFilter) Filter(_ context.Context, _ *types.CycleState, _ *types.LLMRequest, pods []types.Pod) []types.Pod {}
```
2.6. Register the new Filter in pkg/plugins/register.go.
go
```
func RegisterAllPlugins() {  
   plugins.Register(filter.MyFilter, filter.ByMyFilterFactory)  
   // ... other plugins  
 }
```
At this point, a Filter plugin development is complete. Developers can develop other types of EPP plugins following the same process.

Define routing strategy.

To apply the routing strategy, developers need to configure in the yaml file with resource type InferencePool:

Declare the routing strategy name in the inferenceExtension.pluginsConfigFile field.
Declare the specific routing strategy configuration in the inferenceExtension.pluginsCustomConfig field.

yaml

# Configuration example for example-strategy routing strategy
inferencepool:
inferenceExtension:
   pluginsConfigFile: "example_strategy.yaml"
   pluginsCustomConfig:
   exaexample_strategymple.yaml: |
      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: EndpointPickerConfig
      plugins:
      # ... plugins

At this point, developers have completed the development of the routing strategy.

Debugging and Verification

Use kubectl port-forward for port forwarding.

Execute the following command to find the Gateway Service name.

shell

kubectl get svc -n <NAMESPACE> -l gateway.networking.k8s.io/gateway-name=inference-gateway

Execute the following command to set up port forwarding.

shell

kubectl port-forward -n <NAMESPACE> service/<gateway-service-name> 8000:80

Execute the following command to send a request.

shell

curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
   "model": "Qwen/Qwen3-8B",
   "messages": [
      {"role": "user", "content": "Hello"}
   ],
   "max_tokens": 100,
   "temperature": 0.7,
   "stream": false
}'

After development is complete, please refer to the Installation section in the User Guide for deployment and verification work.

AI Inference Hermes Routing Plugin Development Guide ​

Feature Overview ​

Constraints and Limitations ​

Environment Preparation ​

Environment Requirements ​

Hardware Requirements ​

Software Requirements ​

Dependency Components ​

Setting Up the Environment ​

Verify Environment ​

Developing Custom Routing Strategies ​

Usage Scenario Overview ​

System Architecture ​

Development Steps ​

Debugging and Verification ​

AI Inference Hermes Routing Plugin Development Guide

Feature Overview

Constraints and Limitations

Environment Preparation

Environment Requirements

Hardware Requirements

Software Requirements

Dependency Components

Setting Up the Environment

Verify Environment

Developing Custom Routing Strategies

Usage Scenario Overview

System Architecture

Development Steps

Debugging and Verification