For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Overview
    • Quickstart
  • Before You Deploy
    • Infrastructure Sizing
    • Manifest
  • Deployment
    • Installation Overview
    • Image Mirroring
    • Helmfile Installation
  • GPU Cluster Setup
    • GPU Cluster Setup
    • Self-Managed Clusters
  • Configuration
    • Optional Enhancements
    • LLM Function Enablement
    • Gateway Routing
    • Third-Party Registries
    • Registry Allowlist
    • Cluster Configuration
    • KAI Scheduler
  • Using Cloud Functions
    • API
    • Service Keys
    • Function Creation
    • LLM Gateway
    • Generic HTTP Function Invocation
    • gRPC Function Invocation
    • Container Functions
    • Helm Functions
    • Streaming Functions
    • Configure Autoscaling
    • CLI
  • Function Autoscaling
    • Function Autoscaling Overview
    • Architecture
    • Operations
    • Observability
  • Observability
    • Observability
    • Example Dashboards
      • Metrics Overview
      • Cassandra
      • ESS
      • Init Container
      • Invocation Service
      • LLM API Gateway
      • LLM Function Invocation Metrics Report
      • LLM Request Router
      • NVCF API
      • SIS/Spot
      • State Metrics
      • Utils Container
      • Vault/OpenBao
  • Operations
    • Control Plane Operations
    • Cluster Monitoring
    • Troubleshooting
  • Runbooks
    • Runbooks
    • Key Rotation
  • Reference
    • Cluster Reference
    • gRPC Load Testing
    • gRPC Load Test SLI Guide
    • HTTP Load Testing
    • HTTP Load Test SLI Guide
    • HTTP Soak Testing
  • Development
    • Architecture Overview
    • Fake GPU Operator
    • Release Process
  • Managed (Legacy)
    • Function Lifecycle
    • Observability
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoCloud Functions
On this page
  • Label Boundaries
  • Metrics
ObservabilityMetrics

LLM Request Router Metrics

||View as Markdown|
Previous

LLM Function Invocation Metrics Report

Next

NVCF API

The LLM Request Router serves Prometheus metrics from llm-request-router:9090/metrics when llmRequestRouter.metrics.enabled is true.

The self-managed stack maps global.observability.metrics.enabled to this chart value. The request-router chart runs Stargate with --metrics-prefix=llm_request_router_, so deployed metric names use the llm_request_router_ prefix instead of the upstream default stargate_ prefix. The chart also sets the trace service name with --otel-service-name=llm-request-router.

Label Boundaries

Use bounded labels only. Keep routing_key, model, inference_server_id, status, result, and reason to bounded service dimensions. Do not add request IDs, session IDs, function IDs, organization IDs, project IDs, raw URLs, raw prompts, authorization values, or other unbounded request fields as metric labels.

Metrics

Metric nameTypeSource endpointLabelsNotes
llm_request_router_requests_totalCounterllm-request-router:9090/metricsrouting_key, model, inference_server_id, statusTotal proxied requests by selected backend and status.
llm_request_router_proxy_attempts_totalCounterllm-request-router:9090/metricsrouting_key, model, inference_server_id, resultUpstream proxy attempts by selected backend and result.
llm_request_router_proxy_retries_totalCounterllm-request-router:9090/metricsrouting_key, model, reasonTotal proxy retries by retry reason.
llm_request_router_proxy_retry_exhausted_totalCounterllm-request-router:9090/metricsrouting_key, model, reasonTotal requests that exhausted retry options.
llm_request_router_quic_connection_evictions_totalCounterllm-request-router:9090/metricsinference_server_id, reasonTotal QUIC pool evictions by backend and reason.
llm_request_router_quic_hot_path_reconnect_totalCounterllm-request-router:9090/metricsinference_server_id, resultDirect QUIC reconnect attempts from the proxy hot path.
llm_request_router_proxy_replay_buffer_bytesHistogramllm-request-router:9090/metricsmodelProxied request replay buffer size in bytes.
llm_request_router_proxy_duration_secondsHistogramllm-request-router:9090/metricsrouting_key, model, inference_server_idTime to first byte from upstream in seconds.
llm_request_router_routing_duration_secondsHistogramllm-request-router:9090/metricsrouting_key, modelLoad-balancer decision time in seconds.
llm_request_router_active_inference_serversGaugellm-request-router:9090/metricsrouting_key, modelCurrently routable inference servers for a routing target.