Eagle Eye for AI Inference

Feature Overview

Eagle Eye is an observability system for AI inference scenarios, implementing full-link metric collection from AI gateways, inference engines, Mooncake to infrastructure (Ray, K8s, hardware), near real-time transmission and intelligent diagnosis. The system integrates Prometheus's periodic metric collection with the low-latency push mechanism of distributed message queue systems, supporting both trend analysis for scaling decisions and meeting the requirements of modules with high timeliness needs (such as intelligent routing) for second-level data updates. Through an independent hardware health diagnosis module, continuous monitoring and anomaly identification of underlying metrics such as NPU/GPU, temperature, power consumption, and error codes are achieved, building a closed-loop monitoring capability of "collection—transmission—diagnosis—evaluation" to provide solid data support for AI inference system stability, performance optimization, and resource scheduling.

Application Scenarios

System Resource Health Monitoring: Real-time monitoring of infrastructure (Ray, K8s, hardware) to ensure system resource health and stability, promptly discovering and resolving resource bottlenecks to ensure system efficiency.
Inference Process Performance Optimization: Real-time monitoring of performance metrics (such as latency, throughput) and resource usage at each stage of the inference process (such as prefill, decode, etc.), identifying and analyzing performance bottlenecks, optimizing model execution efficiency, and improving response speed and computational efficiency of inference tasks.
Hardware Fault Diagnosis and Repair: View abnormality analysis reports provided by the hardware diagnosis module, which include fault pattern identification and handling recommendations to help quickly locate and resolve hardware faults. The system can monitor NPU/GPU, temperature, power consumption, and other hardware states in real time, generate detailed fault analysis reports, provide specific fault causes and repair solutions, ensuring hardware stability and reliability.
Automatic Scaling Decisions: Obtain SLA-related metrics (such as throughput rate, latency, etc.) and use this data as the basis for automatic scaling decisions, ensuring that inference services dynamically expand or shrink according to load and performance requirements to achieve elastic scaling goals.
Intelligent Routing Decisions: Implement near real-time data updates through distributed message queue systems, enabling intelligent routing to make decisions quickly based on the latest data, thereby optimizing response speed in the AI inference process.
Weight Distribution Acceleration Decisions: Implement near real-time data transmission through distributed message queue systems to obtain network dynamic performance metrics in real time (such as actual transmission rate of node RDMA network cards, remaining available bandwidth, etc.), and combine with static metrics attached to nodes in label form by NPU Feature Discovery (such as theoretical transmission speed of node RDMA network cards, PCIe bandwidth of network cards, etc.). Based on these metrics, the module dynamically evaluates node performance and ultimately selects the node with the best performance for task allocation, ensuring optimal resource utilization and maximizing system stability and processing efficiency.

Capability Scope

Multi-layer Metric Coverage: Covers AI gateways (such as performance, resource consumption, security and compliance audits, governance policy execution tracking), inference engines (API Server, model input/output, inference process, inference engine status), Mooncake (Mooncake master, transfer engine, Mooncake client) and infrastructure (Ray, K8S, hardware), achieving full-link observation.
Near Real-time Metric Transmission: For modules with high timeliness requirements, second-level metric push is implemented through distributed message queue systems to ensure metrics can be perceived promptly and influence decisions.
Scaling Decision Support: Synchronize collected system and runtime metrics to Prometheus for periodic calculation and trend evaluation.
Hardware Health Check and Diagnosis: Build an independent hardware health diagnosis module that periodically collects underlying metrics such as NPU/GPU temperature, power consumption, and error codes, and reports them in real time through distributed message queue systems. The diagnosis module subscribes to and analyzes collected data, combines device model, driver, and firmware information, identifies typical fault patterns based on threshold rules and abnormal metric analysis, and outputs diagnostic conclusions and handling recommendations, achieving a closed loop from data collection to health assessment.

Implementation Principle

Logical View

Figure 1 Layer 0 Logical View

The monitoring system is divided into backend layer and component layer according to business hierarchy.

Backend Layer

Hardware Health Monitor: The hardware health detection module operates as an independent collection component, actively executing metric collection and reporting in a periodic task manner. During operation, the module calls underlying interfaces (DCMI, NVML) or parses system logs (dmesg) according to a fixed collection cycle to obtain device running status and health information. Collection results are published in real time to the diagnosis module through distributed message queue systems, achieving decoupling of collection and diagnosis.
Hardware Diagnosis: The diagnosis module subscribes to metric data published by the collection module through distributed message queue systems, combines device model, driver, and firmware information to analyze hardware health status in real time. The module supports threshold judgment and anomaly detection, identifies typical fault patterns and outputs diagnostic conclusions and handling recommendations, achieving a closed loop from data collection to health assessment.

Component Layer

The component layer provides underlying metric collection, transmission, and display capabilities, covering the following key modules: metric collection (Exporter), high-performance distributed message queue systems, metric storage (Prometheus), and display (Grafana).

Data Flow Diagram

NATS

Figure 2 Data Flow Diagram-NATS

ZMQ

Figure 3 Data Flow Diagram-ZMQ

Complete Data List

Table 1 Complete Data List

Observable Dimension	Observable Category	Observable Item	Observable Sub-item	Specific Metric	Static/Dynamic	Metric Source	Real-time Requirement	Metric Transmission	Currently Supported	Remarks
AI Gateway	Performance	-	-	QPS	Dynamic	EPP	Low	Prometheus	-	-
				Request success rate						-
				Overall latency for non-streaming responses						-
				Time to first token (TTFT) for streaming responses						-
	Resource Consumption	-	-	Token consumption/s	Dynamic	EPP	Low	Prometheus		Large model services charge by token
				Token usage statistics by model		EPP
				Token usage statistics by user		-
	Security and Compliance Audit	-	-	Content security interception logs	Dynamic	-	Low	Prometheus		-
				Risk type statistics		-				-
				Risk user statistics		-				-
	Governance Policy Execution Tracking	-	-	Rate limiting statistics		-	Low	Prometheus		-
				Cache hit status		EPP				-
				Fallback success rate		-				-
Inference Engine	API Server	-	-	Path	Dynamic	-	Low	Prometheus	-	-
				Status code						-
				Client						-
				Duration						-
	Model Input/Output	-	-	Prompt	Static	-	Low	Prometheus	✅	Exposed as OpenTelemetry Trace attributes, needs to be enabled when deploying vLLM. vllm serve your-model --otlp-traces-endpoint="http://your-collector:4317" --collect-detailed-traces=all
				Response mode		-
				Model name		vLLM
				Max tokens		vLLM
				Temperature		vLLM
				top-k		-
				top-p		vLLM
				N		vLLM
	Inference Process	-	-	E2E time	Dynamic	vLLM	Low	Prometheus		-
				TTFT						-
				Prefill time						-
				Decode time						-
				Request waiting time in scheduler						-
				Token interval time						-
	Inference Engine Status	-	-	Running request count	Dynamic	vLLM	Low	Prometheus		-
				Waiting request count						-
				Preempted request count						-
				KV cache usage						-
				Total prompt token count						-
				Total generation token count						-
				Successfully processed request count						-
Mooncake	Mooncake master	-	-	-	Dynamic	Mooncake	Low	Prometheus	-	Mooncake master exposes metrics directly to prometheus through /metrics, while transfer engine and Mooncake client metrics are recorded in logs.
	transfer engine	-	-	-	Dynamic		Low	Prometheus
	Mooncake client	-	-	-	Dynamic		Low	Prometheus
Infrastructure	ray	-	-	-	Dynamic	ray	Low	Prometheus	-	-
	kubernetes	Cluster health	-	-	Dynamic	kube-state-metrics/node-exporter	Low	Prometheus	✅	-
		Resource usage	-	-						-
		Workload status	-	-						-
		Scheduling and Elasticity	Scheduling	Schedule queue		kube-scheduler		Prometheus	✅	-
				Scheduling latency						-
				Scheduling success/failure rate						-
				Retry count						-
			Elasticity	Desired replica count		kube-state-metrics		Prometheus	✅	-
				Current replica count						-
				Ready replica count						-
				Scaling latency		-			-	-
				Image pull time		kubelet			✅	-
				Weight download time		-			-	-
				Scaling trigger count		-			-	-
	Hardware	Compute Resources	-	machine_npu_nums	Dynamic	npu-exporter	Low	Prometheus	✅	-
				npu_chip_info_utilization						-
				npu_chip_info_overall_utilization						-
				npu_chip_info_vector_utilization						-
				npu_chip_info_temperature						-
				npu_chip_info_power						-
				npu_chip_info_voltage						-
				npu_chip_info_aicore_current_freq						-
				npu_chip_info_process_info_num						-
				npu_chip_info_process_info						-
				container_npu_utilization						-
				container_npu_total_memory						-
				container_npu_used_memory						-
		Memory and Video Memory	hbm (910A1/910A2/910A3/910A5)	npu_chip_info_hbm_used_memory						-
				npu_chip_info_hbm_total_memory						-
				npu_chip_info_hbm_utilization						-
				npu_chip_info_hbm_temperature						-
				npu_chip_info_hbm_bandwidth_utilization						-
				npu_chip_info_hbm_ecc_enable_flag						-
				npu_chip_info_hbm_ecc_single_bit_error_cnt						-
				npu_chip_info_hbm_ecc_double_bit_error_cnt						-
				npu_chip_info_hbm_ecc_total_single_bit_error_cnt						-
				npu_chip_info_hbm_ecc_total_double_bit_error_cnt						-
				npu_chip_info_hbm_ecc_single_bit_isolated_pages_cnt						-
				npu_chip_info_hbm_ecc_double_bit_isolated_pages_cnt						-
			ddr (except 910A2/910A3/910A5)	npu_chip_info_total_memory						-
			ddr (except 910A2/910A3/910A5)	npu_chip_info_used_memory						-
		Interconnect and IO	network (except 310/310B/310P)	npu_chip_info_bandwidth_tx						-
				npu_chip_info_bandwidth_rx						-
				npu_chip_link_speed						-
				npu_chip_link_up_num						-
			pcie (910A2)	npu_chip_info_pcie_rx_p_bw						-
				npu_chip_info_pcie_rx_np_bw						-
				npu_chip_info_pcie_rx_cpl_bw						-
				npu_chip_info_pcie_tx_p_bw						-
				npu_chip_info_pcie_tx_np_bw						-
				npu_chip_info_pcie_tx_cpl_bw						-
			hccs (910A2/910A3)	npu_chip_info_hccs_statistic_info_tx_cnt_index						-
				npu_chip_info_hccs_statistic_info_rx_cnt_index						-
				npu_chip_info_hccs_statistic_info_crc_err_cnt_index						-
				npu_chip_info_hccs_bandwidth_info_tx_index						-
				npu_chip_info_hccs_bandwidth_info_rx_index						-
				npu_chip_info_hccs_bandwidth_info_profiling_time						-
				npu_chip_info_hccs_bandwidth_info_total_tx						-
				npu_chip_info_hccs_bandwidth_info_total_rx						-
			roce (except 310/310B/310P)	npu_chip_mac_rx_pause_num						-
				npu_chip_mac_tx_pause_num						-
				npu_chip_mac_rx_pfc_pkt_num						-
				npu_chip_mac_tx_pfc_pkt_num						-
npu_chip_mac_rx_bad_pkt_num				-
npu_chip_mac_tx_bad_pkt_num				-
npu_chip_mac_tx_bad_oct_num				-
npu_chip_mac_rx_bad_oct_num				-
npu_chip_info_rx_fcs_num				-
npu_chip_info_rx_ecn_num				-
npu_chip_roce_rx_all_pkt_num				-
npu_chip_roce_tx_all_pkt_num				-
npu_chip_roce_rx_err_pkt_num				-
npu_chip_roce_tx_err_pkt_num				-
npu_chip_roce_rx_cnp_pkt_num				-
npu_chip_roce_tx_cnp_pkt_num				-
npu_chip_roce_new_pkt_rty_num				-
npu_chip_roce_out_of_order_num				-
npu_chip_roce_qp_status_err_num				-
npu_chip_roce_unexpected_ack_num	-
npu_chip_roce_verification_err_num	-
Network Hardware Performance	-	Theoretical transmission speed of node RDMA network card	Static	npu-feature-discovery	Low	/	✅	-
		Theoretical maximum network download speed of node
		Total bandwidth of NPU card-side RoCE network card
		PCIe bandwidth of network card
	-	Actual transmission speed of node RDMA network card	Dynamic	network-performace-exporter	High	Prometheus NATS/ZMQ	-	-
		Remaining available bandwidth of node RDMA network card					-	-
		How much traffic node transmitted through RDMA network card					-	-
		Current actual maximum network download speed of node					-	-
		NPU card-side RoCE network card transmission rate					-	-
		NPU card-side RoCE network card remaining bandwidth					-	-
		Minimum bottleneck bandwidth between nodes across racks and multiple switches					-	-
Hardware Health	Black box error code	-	Dynamic	eagle-eye-hardware-monitor	High	NATS/ZMQ	✅	-
	Health management fault code	-						-
	hbm (910A1/910A2/910A3/910A5)	npu_chip_info_hbm_temperature						-
		npu_chip_info_hbm_total_memory						-
		npu_chip_info_hbm_used_memory						-
		npu_chip_info_hbm_memory_utilization						-
	ddr (except 910A2/910A3/910A5)	npu_chip_info_memory_utilization						-
	network (except 310/310B/310P)	npu_chip_info_link_status						-
		npu_chip_info_bandwidth_rx						-
		npu_chip_info_bandwidth_tx						-
	Hardware Status	npu_chip_info_power						NPU overload frequency reduction
		npu_chip_info_aicore_freq
		npu_chip_info_aicore_current_freq
		npu_chip_info_temperature
		npu_chip_info_network_status						-
		npu_chip_info_health_status						-

Depends on inference engines (such as vLLM).

Installation

Prerequisites

Hardware Requirements

Eagle Eye itself has no special hardware environment requirements.

Software Requirements

Kubernetes v1.33.0 and above.

Installation Steps

openFuyao Platform

Log in to the openFuyao management dashboard.
Select "Application Market > Application List" in the left navigation bar to jump to the "Application List" interface.
Check "Artificial Intelligence/Machine Learning" in the left "Scenario" section and find the "eagle-eye" card. Or enter "eagle-eye" in the search box.
Click "Deploy" to enter the "Deploy" interface.
Enter the application name, select the installation version and namespace. You can select an existing namespace or create a new namespace. For creating namespaces, refer to Namespaces.
Configure routing in "values.yaml" in parameter configuration.
Click "Deploy" to complete installation.

Standalone Deployment

In addition to openFuyao platform installation and deployment, this feature also provides standalone deployment functionality, with the following two methods:

Obtain the project installation package from the openFuyao official image repository.

Execute the following command to pull the project installation package.
bash
```
helm pull oci://cr.openfuyao.cn/charts/eagle-eye --version xxx
```
Where xxx needs to be replaced with the specific project installation package version, such as 0.0.0-latest. The pulled installation package is in compressed package form.
Execute the following command to extract the installation package.
bash
```
tar -zxvf eagle-eye
```
Install and deploy.
Execute the following command in the same directory as eagle-eye.
bash
```
helm install eagle-eye -n xxxxxx .
```

Obtain from the openFuyao GitCode repository.

Execute the following command to pull the project from the repository.
bash
```
git clone https://gitcode.com/openFuyao/eagle-eye.git
```
Install and deploy. Execute the following command in the same directory as eagle-eye.
bash
```
helm install eagle-eye -n xxxxxx .
```

High Real-time Metric Collection and Reporting for Intelligent Routing

NATS

To meet the high timeliness requirements of modules such as intelligent routing, the system uses NATS for second-level push of key performance metrics. This enables metrics such as waiting request count and NPU/GPU KVCache utilization to be perceived in real time, thereby dynamically adjusting decisions.

Currently only vLLM deployment is supported.

Prerequisites

Kubernetes cluster is deployed and accessible
kubectl command line tool is installed and cluster access permissions are configured
Ascend driver and related dependencies are installed on nodes
NATS is deployed and running normally

Deployment Steps

The NATS publish/subscribe model is used as the metric transmission method. CustomStatLogger is a custom statistics data logger implemented in vLLM by inheriting the StatLoggerBase abstract class. As a publisher, CustomStatLogger actively generates structured operation metric messages after each decoding batch ends and publishes them to the specified NATS topic; Router, as a subscriber, only needs to register a listener on this topic to receive and trigger subsequent processing logic at the moment the message is published.

To avoid the loss of the first batch of messages, the subscriber end (Router) needs to complete connection and subscription before the publisher end (CustomStatLogger).

Router (Subscriber).

Router, as a subscriber, is mainly responsible for receiving published operation metric messages and triggering corresponding processing logic. Refer to the following code example to implement Router and deploy it to the Kubernetes cluster.

python

import logging
import json
from nats.aio.client import Client as NATS_Client

logger = logging.getLogger()

class RouterSubscriber:
    def __init__(self, server_address: str, topic: str):
        """
        Initialize RouterSubscriber instance
        :param server_address: NATS server address
        :param topic: Topic to subscribe to
        """
        self.server_address = server_address
        self.topic = topic
        self.client = NATS_Client()

    async def connect(self):
        """
        Establish connection with NATS server
        """
        try:
            # Establish NATS long connection
            await self.client.connect(self.server_address)
            logger.info("Connected to NATS server at %s", self.server_address)
        except Exception as e:
            logger.error("Failed to connect to NATS server: %s", e)
            raise e

    async def subscribe(self, topic: str, message_handler):
        """
        Subscribe to specified topic and pass received messages to business logic processing
        :param message_handler: Business logic processing function for handling received messages
        """
        try:
            async def on_message(msg):
                """
                Message callback function to handle subscribed messages
                :param msg: NATS message
                """
                logger.info("Received message on topic %s", topic)
                await message_handler(msg)

            # Ensure NATS connection
            if not self.client.is_connected:
                await self.connect()

            # Register subscription topic and bind message callback function
            await self.client.subscribe(topic, cb=on_message)
            logger.info("Successfully subscribed to NATS topic %s", topic)

        except Exception as e:
            logger.error("Failed to subscribe to NATS topic %s: %s", topic, e)
            raise e
            
    async def message_handler(self, msg) -> None:
        """
        NATS original callback, responsible for:
        1. Decoding data from msg;
        2. Calling parse_message for JSON deserialization;
        3. Passing parsed structured data to subsequent business logic processing.
        """
        try:
            data = msg.data.decode("utf-8")
        except Exception as exc:
            logger.error("Failed to decode NATS message: %s", exc)
            return

        data_dict = self.parse_message(data)
        if not data_dict:
            # Return directly if parsing fails to avoid subsequent logic exceptions
            return

        # TODO: Process specific business logic here, such as making routing decisions based on metrics
        ...

    def parse_message(self, data: str) -> Dict[str, Any]:
        """
        Deserialize NATS message body from JSON string to dictionary.
        Log and return empty dictionary if parsing fails.
        """
        try:
            return json.loads(data)
        except Exception as exc:
            logger.error("Failed to parse message: %s, raw: %r", exc, data)
            return {}

Customstatlogger (Publisher).

CustomStatLogger is the publisher, responsible for generating and publishing real-time operation metric data. After each decoding batch ends, CustomStatLogger publishes structured metric messages to the specified NATS topic (this topic should match the subscriber's topic, currently topic eagle_eye.routing_metrics) for Router subscription.

2.1 Package source code.

Execute the following command in the project root directory to package the source code into a compressed file.

bash

tar -czf source_code.tar.gz src/ requirements.txt

2.2 Create ConfigMap.

Import the compressed package into ConfigMap as a binary file.

bash

# Check if namespace exists, create it if it doesn't exist
kubectl get namespace ai-inference || kubectl create namespace ai-inference

# Delete if ConfigMap with the same name already exists
kubectl delete configmap vllm-source-code -n ai-inference --ignore-not-found

# Create new ConfigMap
kubectl create configmap vllm-source-code \
  --from-file=source_code.tar.gz=source_code.tar.gz \
  -n ai-inference

2.3 Deploy service.

Apply deployment YAML file.

bash

kubectl apply -f ./docs/routing-metrics/vllm_eagle_eye.yaml

2.4 Check deployment status.

bash

# Check Pod running status
kubectl get pods -n ai-inference -l app=vllm-eagle-eye

# Check Pod detailed information
kubectl describe pod -n ai-inference -l app=vllm-eagle-eye

# Check Pod logs
kubectl logs -n ai-inference -l app=vllm-eagle-eye -f

Verification steps.

3.1 Method 1: Verify using temporary Pod.

3.1.1 Get service Pod IP address.

bash

kubectl get pod -n ai-inference -l app=vllm-eagle-eye -o wide

Record the IP address in the output (for example: 10.244.1.5).

3.1.2 Start temporary test Pod.

bash

kubectl run curl-test --image=curlimages/curl -n ai-inference -it --rm --restart=Never -- sh

3.1.3 Send test request in temporary Pod.

Execute in the temporary Pod's shell (replace <POD_IP> with actual IP address):

bash

curl -X POST http://<POD_IP>:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello AI"}'

3.2 Method 2: Verify using port-forward.

bash

# Port forwarding
kubectl port-forward -n ai-inference deployment/vllm-eagle-eye 8000:8000

# Send request in another terminal
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello AI"}'

Clean up resources.

bash

# Delete Deployment
kubectl delete -f ./docs/routing-metrics/vllm_eagle_eye.yaml

# Delete ConfigMap
kubectl delete configmap vllm-source-code -n ai-inference

# Delete temporary Pod
kubectl delete pod -n ai-inference curl-test

Common Issues

Pod startup failure.

bash

# Check Pod events
kubectl describe pod -n ai-inference -l app=vllm-eagle-eye

# Check init container logs
kubectl logs -n ai-inference <pod-name> -c code-extractor

Dependency installation failure.

If the cluster cannot access the internet, dependency packages need to be packed into the compressed package in advance.

bash

# Download dependencies locally
pip download -r requirements.txt -d packages/

# Include dependencies when packaging
tar -czf source_code.tar.gz src/ requirements.txt packages/

View source on GitCode

Eagle Eye for AI Inference ​

Feature Overview ​

Application Scenarios ​

Capability Scope ​

Implementation Principle ​

Logical View ​

Data Flow Diagram ​

NATS ​

ZMQ ​

Complete Data List ​

Relationship with Related Features ​

Installation ​

Prerequisites ​

Hardware Requirements ​

Software Requirements ​

Installation Steps ​

openFuyao Platform ​

Standalone Deployment ​

High Real-time Metric Collection and Reporting for Intelligent Routing ​

NATS ​

Prerequisites ​

Deployment Steps ​

Common Issues ​

Eagle Eye for AI Inference

Feature Overview

Application Scenarios

Capability Scope

Implementation Principle

Logical View

Data Flow Diagram

NATS

ZMQ

Complete Data List

Relationship with Related Features

Installation

Prerequisites

Hardware Requirements

Software Requirements

Installation Steps

openFuyao Platform

Standalone Deployment

High Real-time Metric Collection and Reporting for Intelligent Routing

NATS

Prerequisites

Deployment Steps

Common Issues