AI Inference Hermes Routing Plugin Development Guide
Feature Overview
Hermes-router is a Kubernetes (K8s) native AI inference intelligent routing solution that receives user inference requests and forwards them to appropriate inference service backends.
As an EPP plugin under the GIE framework, Hermes-router focuses only on the request forwarding process. This guide introduces how developers can develop custom routing strategies based on the EPP framework.
Constraints and Limitations
Inference Engine Limitations
Hermes-router's support for inference engines is consistent with GIE. However, some plugins depend on cache-indexer to calculate global KVCache hit rate, and cache-indexer in v25.12 only supports vLLM. Therefore, at this stage, these plugins can indirectly only be used when the inference engine is vLLM, including the following plugins:
- scorer-aggregate-kv-cache-aware
- scorer-pd-kv-cache-aware
Other plugins theoretically support multiple inference engines, but currently only scenarios with vLLM as the inference engine have been validated.
Development Limitations
In this document, one EPP instance represents one routing strategy, and a routing strategy typically consists of multiple types of plugins. Plugin development must comply with the GIE Framework Specification.
Deployment Limitations
- Only supports deployment and running in Kubernetes environments.
- Plugins need to be developed locally and built into images before they can be deployed and run.
- If you need to change the routing strategy configuration, the EPP must be restarted after modifying the configuration for the changes to take effect.
Environment Preparation
Environment Requirements
Hardware Requirements
The Hermes-router development environment has no special hardware requirements. It is recommended to configure as follows.
- CPU: 4 cores or more
- Memory: 8GB or more
- Disk: 20GB or more available space
Software Requirements
- Operating System: Linux
- Go Environment: Go 1.21 or higher
- Docker: Docker 20.10+ or compatible container runtime (such as nerdctl)
- Kubernetes Cluster: For deployment and testing
- kubectl: For interacting with K8s cluster
- Helm: Version 3.0+, for deploying Hermes-router
Dependency Components
Basic Dependencies
- Open-source Gateway: Needs to support Gateway API Inference Extension, such as Istio v1.28.
Additional Dependencies
If you want to use the KVCache aware related routing strategies provided by Hermes-router for development, there are the following additional dependencies:
- KVCache Global Management Component: Provides global KVCache awareness and hit rate calculation, such as cache-indexer v25.12.
Setting Up the Environment
Clone the code repository.
bash# Clone hermes-router main repository git clone https://gitcode.com/openFuyao/hermes-router.git cd hermes-routerConfigure Go development environment.
bash# Check Go version go version # Set Go proxy (optional, accelerate dependency download) go env -w GOPROXY=https://goproxy.cn,directInstall Istio and GIE framework.
bash# Download Istio curl -L https://istio.io/downloadIstio | sh - cd istio-* # Install Istio (enable GIE support) istioctl install -y \ --set tag=<ISTIO_TAG> \ --set hub=gcr.io/istio-testing \ --set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true # Install Gateway API CRDs kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml # Install GIE CRDs (adjust according to actual version) kubectl apply -f <GIE_CRDs_YAML>Build hermes-router image.
bashcd hermes-router # Build image docker build -t hermes-router:dev -f Dockerfile.epp . # Or use nerdctl nerdctl build -t hermes-router:dev -f Dockerfile.epp .
Verify Environment
Verify Istio installation.
bash# Check Istio components kubectl get pods -n istio-system # Expected output: All Pods status is RunningVerify GIE CRDs.
bash# Check InferencePool CRD kubectl get crd inferencepools.inference.networking.x-k8s.ioDeploy hermes-router for verification.
bashcd hermes-router # Update Helm dependencies helm dependency update examples/1_pd_bucket/charts/hermes-router # Deploy hermes-router (using the simplest configuration) helm upgrade --install vllm-qwen-qwen3-8b \ examples/1_pd_bucket/charts/hermes-router \ --namespace hermes-test --create-namespace \ -f examples/profiles/epp-random-pd-bucket.yaml \ --set inferencepool.provider.name=istio \ --set image.repository=hermes-router \ --set image.tag=dev # Check deployment status kubectl get pods -n hermes-test kubectl get svc -n hermes-test
Expected Results:
- Pod status is Running.
- Service is created normally.
- No error logs.
[root@master hermes-router]# kd get all
NAME READY STATUS RESTARTS
pod/vllm-pd-decode-5bb99b6764-cs52k 1/1 Running 0
pod/vllm-pd-prefill-6db4c89f78-f7csp 1/1 Running 0
pod/vllm-pd-proxy-5f8bfcf495-p5vpj 1/1 Running 0
pod/vllm-qwen-qwen3-8b-epp-5fd87d6d59-r2xpv 1/1 Running 0
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/vllm-pd-proxy ClusterIP ****** <none> 8000/TCP,5557/TCP 16h
service/vllm-qwen-qwen3-8b-epp ClusterIP ****** <none> 9002/TCP,9090/TCP 83s
NAME READY UP-TO-DATE AVAILABLE
deployment.apps/vllm-pd-decode 1/1 1 1
deployment.apps/vllm-pd-prefill 1/1 1 1
deployment.apps/vllm-pd-proxy 1/1 1 1
deployment.apps/vllm-qwen-qwen3-8b-epp 1/1 1 1
NAME DESIRED CURRENT READY
replicaset.apps/vllm-pd-decode-5bb99b6764 1 1 1
replicaset.apps/vllm-pd-prefill-6db4c89f78 1 1 1
replicaset.apps/vllm-pd-proxy-5f8bfcf495 1 1 1
replicaset.apps/vllm-qwen-qwen3-8b-epp-5fd87d6d59 1 1 1Developing Custom Routing Strategies
Usage Scenario Overview
When preset routing strategies cannot meet specific business needs, developers can extend routing capabilities by implementing custom plugins. Typical scenarios include:
- Business-specific routing rules: Routing based on custom attributes such as business labels, user IDs, etc.
- Performance optimization needs: Optimize routing algorithms for specific hardware or network environments.
- Hybrid architecture support: Support hybrid deployment of aggregated architecture and PD disaggregated architecture.
System Architecture
Hermes-router adopts a plugin architecture, implementing intelligent routing through the collaboration of different types of plugins. Plugins are mainly divided into four categories: Filter plugins filter candidate endpoints, Scorer plugins score endpoints, Picker plugins select the final endpoint from scoring results, and PreRequest plugins modify requests before forwarding. Detailed descriptions of each plugin type are as follows:
Table 1 Hermes-router Plugin Type Descriptions
| Plugin Type | Function Description | Example |
|---|---|---|
| Filter Plugin | Filter candidate endpoint list | filter-by-pd-label: Filter based on PD labels. |
| Scorer Plugin | Score endpoints | scorer-aggregate-kv-cache-aware: Score based on KV Cache.scorer-pd-bucket: Score based on Bucket load. |
| Picker Plugin | Select final endpoint from scoring results | picker-min-random: Select random instance with lowest scorepicker-pd-kv-cache-aware: Intelligent selection plugin for PD architecture. |
| PreRequest Plugin | Modify request before forwarding | pd-header-handler: Add request headers required for PD architecture. |
Plugin Execution Flow
Plugins execute in the order configured in schedulingProfiles:
- Filter Phase: All Filter plugins execute in sequence, gradually narrowing the candidate endpoint range.
- Scorer Phase: All Scorer plugins score candidate endpoints (weight can be configured).
- Picker Phase: Picker plugin selects the final endpoint based on scoring results.
- PreRequest Phase: PreRequest plugin modifies the request before forwarding.
The following links provide relevant documentation required for developing custom routing strategies:
- hermes-router code repository: Hermes-router's source code repository, containing all plugin implementation examples and development framework. Developers can refer to existing plugin code for development.
- GIE Framework Specification: Official specification document for Gateway API Inference Extension (GIE), defining the interface specification and development standards for EPP plugins. This is the specification that must be followed when developing EPP plugins.
- EPP Architecture Proposal: Endpoint Picker Plugin (EPP) architecture proposal, detailing the architectural design, execution flow, and extension mechanism of EPP plugins, helping developers deeply understand plugin working principles.
Development Steps
Design routing strategy.
Developers design routing strategies and specific plugins themselves. This guide skips this step.
Develop routing plugins.
After completing the design of the routing strategy, developers need to split the processing flow into several EPP-compliant plugins and implement them. The main plugin types are Filter, Scorer, Picker, PreRequest, etc.
The following uses a Filter plugin for filtering
Podsas an example to demonstrate the EPP plugin development process:2.1. Create Filter code file.
Hermes-router stores EPP plugins by category. It is recommended to create a new Filter plugin under
pkg/plugins/filter.2.2. Define Filter plugin struct.
gotype MyFilter struct { typedName plugins.TypedName // ... other member variables }2.3. Implement struct constructor.
gofunc NewMyFilter(...){ // ... initialize member variables return &MyFilter{...} }2.4. Implement factory function.
gofunc ByMyFilterFactory(...){ // ... initialize Filter member variables based on parameters passed by GIE return NewMyFilter(...) }2.5. Implement the
Filter()method in theFilterinterface.gofunc (m *MyFilter) Filter(_ context.Context, _ *types.CycleState, _ *types.LLMRequest, pods []types.Pod) []types.Pod {}2.6. Register the new Filter in
pkg/plugins/register.go.gofunc RegisterAllPlugins() { plugins.Register(filter.MyFilter, filter.ByMyFilterFactory) // ... other plugins }At this point, a Filter plugin development is complete. Developers can develop other types of EPP plugins following the same process.
Define routing strategy.
To apply the routing strategy, developers need to configure in the yaml file with resource type InferencePool:
- Declare the routing strategy name in the inferenceExtension.pluginsConfigFile field.
- Declare the specific routing strategy configuration in the inferenceExtension.pluginsCustomConfig field.
yaml# Configuration example for example-strategy routing strategy inferencepool: inferenceExtension: pluginsConfigFile: "example_strategy.yaml" pluginsCustomConfig: exaexample_strategymple.yaml: | apiVersion: inference.networking.x-k8s.io/v1alpha1 kind: EndpointPickerConfig plugins: # ... plugins
At this point, developers have completed the development of the routing strategy.
Debugging and Verification
Use kubectl port-forward for port forwarding.
Execute the following command to find the Gateway Service name.
shellkubectl get svc -n <NAMESPACE> -l gateway.networking.k8s.io/gateway-name=inference-gatewayExecute the following command to set up port forwarding.
shellkubectl port-forward -n <NAMESPACE> service/<gateway-service-name> 8000:80Execute the following command to send a request.
shellcurl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-8B", "messages": [ {"role": "user", "content": "Hello"} ], "max_tokens": 100, "temperature": 0.7, "stream": false }'
After development is complete, please refer to the Installation section in the User Guide for deployment and verification work.