AI Inference Hermes Routing

Feature Overview

Hermes-router is a Kubernetes (K8s) native AI inference intelligent routing solution that receives user inference requests and forwards them to appropriate inference service backends.

Architecturally, Hermes-router follows the K8s Gateway API Inference Extension (GIE) framework as a pluggable, extensible EndPointPicker (EPP) component, maximizing compatibility with the K8s ecosystem.
In terms of capabilities, Hermes-router provides multiple AI inference routing strategies including KVCache aware and PD bucket scheduling, helping users improve AI inference performance, cluster resource utilization, and service stability in various cloud-native scenarios.

Application Scenarios

Hermes-router is suitable for deploying and running AI inference services in Kubernetes cluster environments, specifically including the following scenarios.

Cloud-native AI inference services: Deploying large language model (LLM) inference services in K8s clusters, requiring intelligent routing capabilities to optimize request distribution and resource utilization.
Multi-instance inference backends: Intelligent routing of inference requests to multiple inference service instances (supporting aggregated or PD disaggregated architectures) to achieve load balancing and performance optimization.
High-concurrency inference scenarios: In business scenarios with mixed long and short requests and medium to high concurrency, intelligent scheduling based on request characteristics and instance load status is needed to improve inference throughput.
KVCache aware optimization scenarios: In scenarios with frequent repeated requests, routing optimization based on KVCache hit rate information is needed to improve inference performance and resource utilization.
Gateway integration scenarios: Existing K8s gateway infrastructure where AI inference routing capabilities need to be added without affecting the original gateway.

Capability Scope

Supports user deployment and usage in K8s clusters.
Supports user configuration of multiple routing strategies for OpenAI API-style AI inference requests.

Software Dependencies

As an EPP component of the GIE framework, Hermes-router needs to be used with open-source gateways that support Gateway API Inference Extension. The table below lists optional open-source gateways and their dependency component version requirements.

Table 1 Optional Open-source Gateways and Dependency Component Versions

Gateway	Gateway Version	Gateway API	GIE	Kubernetes
Istio	1.27+	1.4.0+	1.0+	1.29+
Nginx Gateway Fabric	2.2+	1.3.0+	1.0+	1.25+
Envoy AI Gateway	0.4+	1.4.0+	1.0+	1.32+
Kgateway	2.1+	1.3.0+	1.0+	1.29+

Note:
The table shows the minimum verified compatible versions. It is recommended to use the latest versions of open-source gateways to ensure the best experience.

Highlight Features

Here are the highlights independent of the GIE framework.

Feature design follows the GIE framework, naturally supports the K8s gateway ecosystem, and can be integrated with multiple open-source gateways. In clusters with existing gateways, it can be added as a pluggable capability, adding AI inference routing capabilities without affecting the original gateway.
Provides multiple innovative routing strategies, supporting aggregated, PD, and other inference backend architectures, helping users improve performance in various business scenarios.
- KVCache aware (aggregated/PD): Provides a KVCache aware routing strategy that allows users to customize scoring functions, improving inference performance in repeated request scenarios.
- PD bucket scheduling routing (PD): Provides a bucket scheduling strategy that allows users to customize parameters, improving inference throughput in scenarios with mixed long and short requests and medium to high concurrency.
Dynamic inference service discovery: Allows users to add/remove inference backends at runtime, enabling flexible adjustment of inference resource investment.

Implementation Principle

Figure 1 Hermes-router Architecture

Hermes-router is integrated into open-source gateways as an EPP component. The following uses a complete inference request as an example to explain the internal principle.

Users send an OpenAI API request /v1/chat/completions to the cluster gateway.
The gateway identifies the request as an inference request and forwards it to the EPP.
The EPP processes the request according to the user-configured routing strategy and selects the most suitable inference backend for processing the inference request.
The EPP returns the inference backend to the cluster gateway, and the cluster gateway sends the inference request to the target inference backend.
The inference backend completes the request and returns it to the gateway, which returns the inference result to the user.

cache-indexer: Required when using KVCache aware type strategies, obtains KVCache hit rate from this component through the /match_sort interface.
vLLM-ascend: The following specific dependencies exist for this feature.
- PD proxy service component proxy-server: Originally a sample component provided by vllm officially in the PD disaggregated architecture, serving as a hub to organize P/D instances to complete inference tasks. The openFuyao community has enhanced this component, and it now serves as the Leader instance of the PD Group to receive gateway inference requests and has the ability to dynamically discover inference service instances based on specified labels.
- NPU adaptation: When the environment is Ascend NPU, vLLM-ascend needs to be used as the inference engine to start the service.
- Inference metrics: Depends on vllm's /metrics interface to obtain inference service metrics, automatically obtained by the GIE architecture.

Installation

EPP Component Standalone Deployment

This section describes how to deploy Hermes-router as an EPP component in a cluster with existing Kubernetes Gateway infrastructure.

Delivery Specification

Hermes-router is deployed to the cluster as a standalone EPP component, requiring the cluster to already have an Envoy-based Gateway, Gateway API and Inference Extension CRDs, and inference backend services. After deployment, Hermes-router provides multiple AI inference intelligent routing strategies (KVCache aware, PD bucket scheduling, etc.), dynamically discovers and manages inference backends through InferencePool CR, and supports flexible configuration through HTTPRoute.

Prerequisites

Before starting the installation, please ensure the following conditions are met.

Environment Requirements
- Kubernetes cluster: v1.33.0 and above.
- Cluster administrator permissions: For installing CRDs and cluster-level resources.
- Helm tool: For deploying Hermes router and related components.
Deployment Component Requirements
Before deploying Hermes-router, the following components need to be installed in the cluster.
- Envoy-based gateway: The cluster has deployed a gateway supporting the ExtProc protocol (such as Istio, Envoy Gateway, etc.). Hermes-router interacts with the gateway through ExtProc (gRPC).
- Gateway API CRDs: Kubernetes Gateway API core resource definitions are installed.
- Inference Extension CRDs: Gateway API Inference Extension is installed, providing inference extension resource definitions such as InferencePool.
- Inference backend services: Inference engine services such as vLLM are deployed in the cluster.
Note:
If the above components are not yet installed in the cluster, please refer to the Install Supporting Components section to complete the installation.
Hardware Requirements
Hermes-router itself has no special hardware environment requirements. As a lightweight routing component, it can run on standard x86 or ARM architecture nodes.

Quick Install Hermes Router

Hermes-router supports multiple routing strategies. Users can obtain the chart package and preset routing strategy configuration files from the openFuyao GitCode repository according to business scenarios.

Pull the project from the repository.

bash

git clone https://gitcode.com/openFuyao/hermes-router.git

Install and deploy.

Taking the release name hermes-router as an example, please ensure the following operations are completed before executing the installation.

bash

cd hermes-router/charts/hermes-router
helm dependency build
helm install -n <NAMESPACE> hermes-router . \
  -f <routing_strategy_file_name>

Parameter descriptions are as follows.

<NAMESPACE>: Target namespace for deployment (such as ai-inference).
<routing_strategy_file_name>: Directly use the strategy files in Table 2; the repository provides examples in the examples/profiles/ directory (see profiles directory), which can be reused or customized as needed.

Table 2 Preset Routing Strategy List

Strategy File	Strategy Name	Applicable Scenario	Description
`epp-random-pd-bucket.yaml`	Random PD Bucket Routing	Simple load balancing	Randomly selects PD Bucket to achieve basic load balancing.
`epp-pd-bucket.yaml`	PD Bucket Scheduling Routing	Load-based routing	Scores PD Buckets based on their load status and selects the optimal instance, supporting TP heterogeneous PD disaggregated architecture.
`epp-pd-kv-cache-aware.yaml`	PD KVCache Aware Routing	KVCache optimization under PD architecture	Intelligently selects the optimal inference service by combining KVCache hit rate, XPU cache usage, and other information (suitable for PD architecture).
`epp-kv-cache-aware.yaml`	Aggregated Architecture KVCache Aware Routing	KVCache optimization under aggregated architecture	Intelligently selects the optimal inference service by combining KVCache hit rate, XPU cache usage, and waiting request count (suitable for aggregated architecture).

Verify deployment.

bash

# Check Pod running status
kubectl get pods -n <NAMESPACE> -l inferencepool=<INFERENCEPOOL_NAME>-epp

# Check InferencePool resources
kubectl get inferencepool -n <NAMESPACE>

# Check HTTPRoute resources
kubectl get httproute -n <NAMESPACE>

Note:
The EPP Pod label format is inferencepool=<INFERENCEPOOL_NAME>-epp, where <INFERENCEPOOL_NAME> is the name of the InferencePool resource, corresponding to the .Values.inferencepool.name configuration item in values.yaml. For example, if the InferencePool name is vllm-qwen-qwen3-8b, the EPP Pod label is app=vllm-qwen-qwen3-8b-epp.

Notice:
HTTPRoute and InferencePool CR need to be configured correctly when deploying Hermes router.
HTTPRoute needs to be associated with the Gateway resource in the cluster through parentRef.
InferencePool needs to configure the correct label selector (matchLabels) to discover and manage inference backend instances.
For detailed routing strategy configuration, please refer to the Configure Routing Strategy section.

InferNex Integrated Deployment

This section describes how to deploy Hermes-router through Infernex integration.

Delivery Specification

InferNex is a complete AI inference service integrated deployment package that deploys gateway, Hermes-router, HTTPRoute/InferencePool and other K8s resources, and inference backend services with one click. It provides an end-to-end AI inference solution, integrating gateway, intelligent routing, and inference services, ready to use out of the box.

When there is no gateway or inference backend in the environment and out-of-the-box deployment is needed, InferNex can be used directly for integrated deployment. Refer to Installation and Configuration Guide.

Prerequisites

Kubernetes v1.33.0 and above.
Kubernetes Gateway API CRDs: Provides core resource definitions for Gateway API.
Gateway API Inference Extension CRDs: Provides inference extension resource definitions such as InferencePool.
At least one inference chip per inference node.
At least 16GB memory and 4 CPU cores per inference node.
Online installation requires access to the image repository: oci://cr.openfuyao.cn.
Users have permissions to create RBAC resources.

Quick Install InferNex

InferNex has two independent deployment methods.

Obtain the project installation package from the openFuyao official image repository.
1. Pull the project installation package.
  bash
```
helm pull oci://cr.openfuyao.cn/charts/infernex --version xxx
```
  Where xxx needs to be replaced with the specific project installation package version, such as 0.21.1. The pulled installation package is in compressed package form.
2. Extract the installation package.
  bash
```
tar -xzvf infernex-xxx.tgz
```
  Where xxx needs to be replaced with the specific project installation package version, such as 0.21.1.
3. Install and deploy.
  Taking the release name infernex as an example, please ensure the following operations are completed before executing the installation.
  - The namespace istio-system has been created in the cluster (Istio Gateway resources must be deployed in this namespace).
  - The namespace specified by the global.namespace configuration item in values.yaml has been created in the cluster. The namespace for other components (such as inference-backend, hermes-router, cache-indexer, etc.) can be set through the global.namespace configuration item in values.yaml, with a default value of ai-inference.
  Execute the following command in the same directory as infernex.
  bash
```
helm install -n istio-system infernex ./infernex
```
Obtain from the openFuyao GitCode repository.
1. Pull the project from the repository.
  bash
```
git clone https://gitcode.com/openFuyao/InferNex.git
```
2. Install and deploy.
  Taking the release name infernex as an example, when installing with helm, the namespace istio-system needs to be specified because the open-source gateway istio is integrated. The namespace settings for other components can be configured in global.namespace in values.yaml. Execute the following command in the same directory as InferNex.
  bash
```
cd InferNex/charts/infernex
helm dependency build
helm install -n istio-system infernex .
```

Install Supporting Components

If supporting components such as gateway and inference backend are not yet deployed in your cluster, please complete the installation according to this section.

Open-source Gateway Installation

Hermes-router needs to be used with open-source gateways that support Kubernetes Gateway API and Gateway API Inference Extension. This document uses Istio as an example to introduce the installation and deployment process.

Install Istio and enable Gateway API Inference Extension support.
bash
```
istioctl install -y \
  --set tag=<ISTIO_TAG> \
  --set hub=gcr.io/istio-testing \
  --set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true
```
⚠️Notice:
ISTIO_TAG needs to use an Istio version that supports Inference Extension. Execute curl -s https://storage.googleapis.com/istio-build/dev/1.28-dev to get the latest version.
Verify installation.
Verify Istio installation.
shell
```
kubectl get pods -n istio-system
```

Deploy Inference Gateway.

After completing infrastructure installation, you can deploy Inference Gateway.

shell

helm upgrade --install inference-gateway examples/1_pd_bucket/charts/gateway \
  -n <NAMESPACE> --create-namespace \
  --set gateway.className=istio

Open-source gateway notes are as follows.

Istio version: An Istio development version that supports Inference Extension is required; stable versions may not support it.
Permission requirements: Installing CRDs requires cluster administrator permissions.
Other open-source gateways: In addition to Istio, users can choose open-source gateways that support Gateway API and Gateway API Inference Extension according to their needs. When configuring, just set gateway.className to the corresponding GatewayClass name.

Inference Engine Backend Installation

Hermes-router currently supports the vLLM inference engine, providing both aggregated and PD disaggregated architectures. Users can choose deployment as needed.

Install according to aggregated architecture, execute the following command.

shell

helm upgrade --install vllm examples/2_kv_aware/charts/vllm \
  -n ${NAMESPACE} --create-namespace \
  --set modelServer.name=$(MODEL) \
  --set modelServer.rootCachePath=/home/llm_cache

Install according to PD disaggregated architecture (Prefill-Decode Disaggregated), execute the following command.

shell

helm upgrade --install vllm-pd examples/1_pd_bucket/charts/vllm-pd \
    -n ${NAMESPACE} --create-namespace \
    --set modelServer.name=$(MODEL) \
    --set modelServer.rootCachePath=/home/llm_cache

In vLLM disaggregated architecture deployment, the proxy server needs to be able to parse the request headers added by EPP routing to achieve precise routing to P-end and D-end backend services. Specific implementation can refer to: proxy_server_example.

Install cache-indexer (Optional)

When Hermes-router uses the KVCache aware routing strategy, the cache-indexer component must be installed to obtain global KV Cache information. The installation steps are as follows.

Obtain the openFuyao/cache-indexer component helm chart deployment package.
shell
```
helm fetch oci://cr.openfuyao.cn/charts/cache-indexer --version 0.20.0
```

Configure cache-indexer to correctly provide global KVCache hit rate calculation service. Open the charts/cache-indexer/values.yaml file in the helm chart obtained in the previous step for configuration. The required parameters are described below.

yaml

app:
  serviceDiscovery: # This configuration is used to dynamically discover inference service instances and subscribe to kv cache messages
    labelSelector: "openfuyao.com/model=qwen-qwen3-8b" # Dynamically discover inference instance Pods carrying this label
    portName: "zmq-pub" # Subscribe to kv cache messages in the vllm port with this name;
    refreshInterval: 10 # Subscription interval (s)

service:
  name: cache-indexer-service # Name of the service resource at runtime, hermes-router requests cache-indexer through this name
  port: 8080 # External port

# ... Other configurations

Deploy cache-indexer.

shell

helm upgrade --install cache-indexer ./charts/cache-indexer \
    -n ${NAMESPACE} --create-namespace

Check deployment results.
Confirm that the Pod is running normally and that the logs show that inference service backend instances have been successfully discovered.

Using AI Inference Services

After deployment, you can send inference requests to the Inference Gateway in the following two ways.

LoadBalancer Access

If the cluster supports LoadBalancer, Istio Gateway will automatically create a LoadBalancer type Service.

shell

# Get External IP
EXTERNAL_IP=$(kubectl get svc -n istio-system istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

Send inference request.

shell

curl -X POST http://${EXTERNAL_IP}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [
      {"role": "user", "content": "Hello"}
    ],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

NodePort Access

Execute the following command to get the node IP address and port.

shell

kubectl get svc -n istio-system istio-ingressgateway

Check the PORT(S) column, for example 80:30080/TCP, where 30080 is the NodePort.

If there is no ExternalIP, use InternalIP.

shell

NODE_IP=$(kubectl get nodes -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
NODE_PORT=$(kubectl get svc -n istio-system istio-ingressgateway -o jsonpath='{.spec.ports[?(@.port==80)].nodePort}')

Send request.

shell

curl -X POST http://${NODE_IP}:${NODE_PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [
      {"role": "user", "content": "Hello"}
    ],
    "max_tokens": 100,
    "temperature": 0.7,
    "stream": false
  }'

Configure Routing Strategy

Metric Mapping

When using vLLM-Ascend in NPU environments, metric name mapping is required. By default, the kv-cache-usage-percentage-metric used by the GIE framework is the metric name for vLLM in GPU environments. To adapt to vLLM-Ascend, this metric needs to be mapped to vllm:kv_cache_usage_perc used by vLLM-Ascend through flags, configured as follows:

yaml

  inferenceExtension:
    flags:
      kv-cache-usage-percentage-metric: "vllm:kv_cache_usage_perc"

This configuration maps the metric kv-cache-usage-percentage-metric collected by the GIE framework to vllm:kv_cache_usage_perc, so that subsequent routing strategies can calculate based on the correct metric name.

aggregate kv cache aware

yaml

inferenceExtension:
  pluginsConfigFile: "epp-aggregate-kv-cache-aware.yaml"
  pluginsCustomConfig:
    epp-aggregate-kv-cache-aware.yaml: |
      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: EndpointPickerConfig
      plugins:
      - type: scorer-aggregate-kv-cache-aware    # Scoring plugin
        parameters:                              # Scoring parameters
          kvCacheHitNotRateWeight: 1.0
          xpuCacheUsageWeight: 1.0
          waitingRequestWeight: 1.0
          kvCacheManagerIP: cache-indexer-service
          kvCacheManagerPort: 8080
          kvCacheManagerPath: /match_sort
          kvCacheManagerTimeout: 5000000000
      - type: picker-min-random                  # Picker plugin
      schedulingProfiles:
      - name: default
        plugins:
        - pluginRef: scorer-aggregate-kv-cache-aware
        - pluginRef: picker-min-random

Plugin descriptions are as follows.

scorer-aggregate-kv-cache-aware: Scores inference service instances based on KVCache hit rate, XPU cache usage, and waiting request count.
picker-min-random: Randomly selects one from the instances with the lowest score to achieve load balancing.

Table 3 KVCache Aware Agg Parameter Descriptions

Parameter	Type	Description	Default Value
`kvCacheHitNotRateWeight`	float	KVCache miss rate weight.	1.0
`xpuCacheUsageWeight`	float	XPU cache usage weight.	1.0
`waitingRequestWeight`	float	Waiting request count weight.	1.0
`kvCacheManagerIP`	string	KVCache Indexer service IP address/name.	cache-indexer-service
`kvCacheManagerPort`	int	KVCache Indexer service port.	8080
`kvCacheManagerPath`	string	KVCache Indexer API path.	/match_sort
`kvCacheManagerTimeout`	int	KVCache Indexer request timeout (nanoseconds).	5000000000

Weight parameter descriptions are as follows.

Increase weight: This metric has a greater impact on scoring.
Decrease weight: The impact of this metric is reduced.
Example: If you care more about KVCache hit rate, you can increase kvCacheHitNotRateWeight.

pd kv cache aware

yaml

inferenceExtension:
  pluginsConfigFile: "epp-pd-kv-cache-aware.yaml"
  pluginsCustomConfig:
    epp-pd-kv-cache-aware.yaml: |
      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: EndpointPickerConfig
      plugins:
      - type: filter-by-pd-label              # PD label filter plugin
      - type: scorer-pd-kv-cache-aware         # PD KV Cache aware scoring plugin
      - type: picker-pd-kv-cache-aware         # PD KV Cache aware picker
      - type: pd-header-handler                # PD request header handler plugin
      schedulingProfiles:
      - name: default
        plugins:
        - pluginRef: filter-by-pd-label
        - pluginRef: scorer-pd-kv-cache-aware
        - pluginRef: picker-pd-kv-cache-aware
        - pluginRef: pd-header-handler

The plugins used by the pd kv cache aware strategy are as follows.

filter-by-pd-label: Filters inference service instances based on PD role (Prefill/Decode) and group ID.
scorer-pd-kv-cache-aware: Combines KVCache hit rate, XPU cache usage, and waiting request count to score Prefill and Decode Pods separately.
picker-pd-kv-cache-aware: Based on scoring results, intelligently selects the optimal Prefill or Decode Pod.
pd-header-handler: Adds request header information required for PD disaggregated architecture to specify routing backend.

The parameter descriptions for the filter-by-pd-label and scorer-pd-kv-cache-aware plugins are shown in Table 4 and Table 5.

Table 4 filter-by-pd-label Parameter Descriptions

Parameter	Type	Description	Default Value
`pdLabelName`	string	PD role label name.	openfuyao.com/pdRole
`pdGroupLabelName`	string	PD group label name.	openfuyao.com/pdGroupID
`prefillValue`	string	Prefill role label value.	prefill
`decodeValue`	string	Decode role label value.	decode
`leaderValue`	string	Leader role label value.	leader

Table 5 scorer-pd-kv-cache-aware Parameter Descriptions

Parameter	Type	Description	Default Value
`kvCacheHitNotRateWeight`	float	KVCache miss rate weight.	1.0
`xpuCacheUsageWeight`	float	XPU cache usage weight.	1.0
`prefillWaitingRequestWeight`	float	Prefill waiting request count weight.	1.0
`decodeWaitingRequestWeight`	float	Decode waiting request count weight.	1.0
`prefillPodScoreWeight`	float	Prefill Pod score weight.	1.0
`decodePodScoreWeight`	float	Decode Pod score weight.	1.0
`kvCacheManagerIP`	string	KVCache Indexer service IP address/name.	cache-indexer-service
`kvCacheManagerPort`	int	KVCache Indexer service port.	8080
`kvCacheManagerPath`	string	KVCache Indexer API path.	/match_sort
`kvCacheManagerTimeout`	int	KVCache Indexer request timeout (nanoseconds).	5000000000

PD label configuration descriptions are as follows.

pdLabelName and pdGroupLabelName: Must be consistent with Pod labels in vLLM PD deployment.
prefillValue, decodeValue: Must match the openfuyao.com/pdRole label value of Pods.
Ensure that the groupID of vLLM PD deployment matches the group label in the routing strategy.

Weight parameter adjustment descriptions are as follows.

Prefill-related weights: Adjust routing decisions for the Prefill phase.
- prefillWaitingRequestWeight: Impact of Prefill waiting request count.
- prefillPodScoreWeight: Impact of Prefill Pod score.
Decode-related weights: Adjust routing decisions for the Decode phase.
- decodeWaitingRequestWeight: Impact of Decode waiting request count.
- decodePodScoreWeight: Impact of Decode Pod score.
General weights: Affect both Prefill and Decode.
- kvCacheHitNotRateWeight: Impact of KV Cache miss rate.
- xpuCacheUsageWeight: Impact of XPU cache usage.

KVCache Indexer configuration is consistent with the above.

pd bucket

yaml

inferenceExtension:
  pluginsConfigFile: "epp-pd-bucket.yaml"
  pluginsCustomConfig:
    epp-pd-bucket.yaml: |
      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: EndpointPickerConfig
      plugins:
      - type: filter-by-pd-label              # PD label filter plugin
      - type: scorer-pd-bucket                 # PD Bucket scoring plugin
      - type: pd-header-handler                # PD request header handler plugin
      schedulingProfiles:
      - name: default
        plugins:
        - pluginRef: filter-by-pd-label
        - pluginRef: scorer-pd-bucket
          weight: 1
        - pluginRef: pd-header-handler

The plugins used by the pd bucket strategy are as follows.

filter-by-pd-label: Filters inference service instances based on PD role (Prefill/Decode) and group ID.
scorer-pd-bucket: Scores based on request length and Pod load status, routing requests to the lightest-loaded Bucket.
pd-header-handler: Handles request header information required for PD disaggregated architecture.

The parameter descriptions for the filter-by-pd-label and scorer-pd-bucket plugins are shown in Table 4 and Table 6.

Table 6 scorer-pd-bucket Parameter Descriptions

Parameter	Type	Description	Default Value
`alpha`	float	Load scoring coefficient.	1.0
`beta`	float	Request length scoring coefficient.	2.0
`decayFactor`	float	Load decay factor (0-1).	0.99
`bucketSeperateLength`	int	Bucket separation length threshold.	200

PD label configuration descriptions are as follows.

pdLabelName and pdGroupLabelName: Must be consistent with Pod labels in vLLM PD deployment.
prefillValue, decodeValue: Must match the openfuyao.com/pdRole label value of Pods.
Ensure that the groupID of vLLM PD deployment matches the group label in the routing strategy.

Scoring algorithm parameter descriptions are as follows.

alpha: Controls the weight of Pod current load in scoring, the larger the value, the greater the load impact.
beta: Controls the weight of request length in scoring, the larger the value, the greater the request length impact.
decayFactor: Load decay factor used to smooth load changes, the closer to 1, the slower the decay.
bucketSeperateLength: Request length threshold used to distinguish long requests and short requests, routing requests to different Buckets.

Scoring weight description is as follows.

weight: 1: Weight of the scoring plugin in the scheduling configuration, can be adjusted as needed.

Working Principle

Request classification: Compares request length with bucketSeperateLength to classify requests into long and short requests.
Load scoring: Calculates load score by combining Pod's current load and load decay factor.
Comprehensive scoring: Based on alpha and beta weights, combines load score and request length score.
Routing selection: Selects the Pod with the lowest score (lightest load) for routing.

Other Routing Strategies

Next, we will introduce other commonly used routing strategies and their configuration methods.

aggregate random

yaml

inferenceExtension:
  pluginsConfigFile: "epp-aggregate-random.yaml"
  pluginsCustomConfig:
    epp-aggregate-kv-cache-aware.yaml: |
      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: EndpointPickerConfig
      plugins:
      - type: picker-min-random                  # Use only random picker
      schedulingProfiles:
      - name: default
        plugins:
        - pluginRef: picker-min-random

The plugin used by the aggregate random strategy is picker-min-random.

random pd

yaml

inferenceExtension:
  pluginsConfigFile: "epp-random-pd.yaml"
  pluginsCustomConfig:
    epp-random-pd-bucket.yaml: |
      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: EndpointPickerConfig
      plugins:
      - type: filter-by-pd-label              # PD label filter plugin
      - type: picker-random-pd-bucket         # Random PD Bucket picker
      - type: pd-header-handler                # PD request header handler plugin
      schedulingProfiles:
      - name: default
        plugins:
        - pluginRef: filter-by-pd-label
        - pluginRef: picker-random-pd-bucket
        - pluginRef: pd-header-handler

The plugins used by the random pd strategy are as follows.

filter-by-pd-label: Filters inference service instances based on PD role (Prefill/Decode) and group ID.
picker-random-pd: Randomly selects one from qualified Pods to achieve basic load balancing.
pd-header-handler: Handles request header information required for PD disaggregated architecture.

Configure Disaster Recovery Capabilities

Prerequisites

An open-source gateway supporting GIE has been deployed in the K8s environment.

Background Information

Supported disaster recovery capabilities include automatic traffic switching, fault recovery, and request retry, aiming to ensure lossless or low-loss switching of request traffic when inference backends fail or restart, and automatic retry of inference requests when they fail due to various exceptions according to predetermined rules. The disaster recovery capability architecture is shown in the figure below.

Figure 2 Disaster Recovery Capability Architecture

Automatic Traffic Switching

The automatic traffic switching process is as follows.

Fault determination: The monitoring system detects business exceptions in backend services (such as long-term service unresponsiveness).
Trigger exit: Triggers the graceful exit process of the fault handling service.
Active offline: Deletes the Pod of the faulty backend service (or triggers automatic Pod termination).
Traffic switching:
- New traffic: After Pod deletion, the K8s Endpoint Controller removes the IP address from the Service/InferencePool list, and new traffic is automatically routed to other nodes.
- In-flight traffic: For requests being sent to the faulty Pod, requests will fail due to Pod termination or network unreachability. At this time, the gateway proxy captures 5xx errors or connection failures, triggering automatic retry, and forwards the request to other healthy backend services.

Fault Recovery

The fault recovery process is as follows.

Restart service: Pull up new inference backend Pods by K8s cluster or manually by users.
Service discovery warm-up: EPP discovers and waits for Pods to be ready, and verifies service availability by sending inference requests.
Online to receive traffic: After verification passes, EPP adds new Pods to the available service backend list.

Request Retry

The request retry mechanism is used to handle timeout or abnormal inference requests, ensuring system reliability and fault tolerance. In the GIE architecture, the retry mechanism needs to be configured on the gateway data plane. When inference requests are forwarded through the gateway, the retry logic is directly executed by the gateway's Envoy proxy. The specific request retry logic is shown in the figure below.

Figure 3 Disaster Recovery Capability Request Retry Process

Usage Restrictions

The current disaster recovery capabilities have been validated on the open-source gateway Istio, and the configurations in this subsection are applicable to Istio.
Due to Envoy data plane runtime logic limitations, when the gateway triggers retry requests, the gateway will not call EPP again but will select from the inference backend Pods in the inferencepool resource pool.
Since the current retry mechanism requires inference backends to be at the Pod resource granularity, it does not support the inferencepool resource pool containing Prefill, Decode, and other Pods that do not provide complete inference capabilities. Other disaster recovery capabilities are normally supported. The routing strategies currently not supported by the retry mechanism include: pd kv cache aware, pd bucket, and random pd.

Operation Steps

Automatic traffic switching and fault recovery capabilities are directly supported as basic capabilities of Hermes-router. Users only need to pay attention to request retry related configurations.

Enable Disaster Recovery in Standalone Deployment

When deploying Hermes-router standalone, other components required for inference services (open-source gateway, inference backend, etc.) are deployed separately by users. Users need to add the following configuration in charts/hermes-router/values.yaml to enable disaster recovery.

yaml

provider:
    istio:
      destinationRule:
        trafficPolicy:
          tls:
            mode: SIMPLE         
            insecureSkipVerify: true
      retryConfig:
        enabled: false
        retryOn: "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes,5xx,reset"
        numRetries: 3

Parameter descriptions in the above configuration are shown in the table below.

Table 7 Disaster Recovery Capability Request Retry Parameters

Parameter	Description
`enabled`	Whether to enable request retry capability, optional: `true`/`false`.
`retryOn`	List of error types that trigger retry, typical optional values include: `connect-failure`, `refused-stream`, `unavailable`, `cancelled`, `retriable-status-codes`, `5xx`, `reset`, etc., can be combined as needed.
`numRetries`	Maximum retry count allowed for a single request.
`mode`	TLS mode when Istio communicates with backends, optional: `DISABLE` (TLS not enabled), `SIMPLE` (one-way TLS), `MUTUAL`/`ISTIO_MUTUAL` (mutual TLS, depends on certificates or identity provided by Istio).
`insecureSkipVerify`	Whether to skip verification of backend service certificates, optional: `true`/`false`; `true` is only recommended for testing/validation environments, production environments are recommended to set to `false`.

When deploying Hermes-router standalone, it is recommended to configure health probes for inference backend Pods to enhance disaster recovery effectiveness.

Enable Disaster Recovery in InferNex Deployment

When deploying through InferNex, disaster recovery configuration is preset in the helm chart. Users only need to set provider.istio.retryConfig.enabled to true in charts/infernex/values.yaml to enable request retry capability. Retry strategy parameters (such as retryOn, numRetries) can be adjusted according to business needs, configured in the same way as standalone deployment.

Note:
Inference backends in InferNex have health probes configured by default and do not require additional configuration.

View source on GitCode

AI Inference Hermes Routing ​

Feature Overview ​

Application Scenarios ​

Capability Scope ​

Software Dependencies ​

Highlight Features ​

Implementation Principle ​

Relationship with Related Features ​

Installation ​

EPP Component Standalone Deployment ​

Delivery Specification ​

Prerequisites ​

Quick Install Hermes Router ​

InferNex Integrated Deployment ​

Delivery Specification ​

Prerequisites ​

Quick Install InferNex ​

Install Supporting Components ​

Open-source Gateway Installation ​

Inference Engine Backend Installation ​

Install cache-indexer (Optional) ​

Using AI Inference Services ​

Configure Routing Strategy ​

Metric Mapping ​

aggregate kv cache aware ​

pd kv cache aware ​

pd bucket ​

Other Routing Strategies ​

aggregate random ​

random pd ​

Configure Disaster Recovery Capabilities ​

Prerequisites ​

Background Information ​

Automatic Traffic Switching ​

Fault Recovery ​

Request Retry ​

Usage Restrictions ​

Operation Steps ​

Enable Disaster Recovery in Standalone Deployment ​

Enable Disaster Recovery in InferNex Deployment ​

AI Inference Hermes Routing

Feature Overview

Application Scenarios

Capability Scope

Software Dependencies

Highlight Features

Implementation Principle

Relationship with Related Features

Installation

EPP Component Standalone Deployment

Delivery Specification

Prerequisites

Quick Install Hermes Router

InferNex Integrated Deployment

Delivery Specification

Prerequisites

Quick Install InferNex

Install Supporting Components

Open-source Gateway Installation

Inference Engine Backend Installation

Install cache-indexer (Optional)

Using AI Inference Services

Configure Routing Strategy

Metric Mapping

aggregate kv cache aware

pd kv cache aware

pd bucket

Other Routing Strategies

aggregate random

random pd

Configure Disaster Recovery Capabilities

Prerequisites

Background Information

Automatic Traffic Switching

Fault Recovery

Request Retry

Usage Restrictions

Operation Steps

Enable Disaster Recovery in Standalone Deployment

Enable Disaster Recovery in InferNex Deployment