LLM Function Invocation Metrics Report

View as Markdown

This report covers the metrics available on the LLM function invocation path: the LLM API Gateway, the LLM Request Router, and the Stargate client sidecar in LLM function pods.

Scrape points

ComponentEndpointService nameMetric prefix
LLM API Gatewayllm-api-gateway:9464/metricsllm-api-gatewayllm_api_gateway_
Rate limit sync worker:9464/metrics when deployed with METRICS_PORT=9464llm-api-gateway-rate-limit-sync-workerllm_api_gateway_
LLM Request Routerllm-request-router:9090/metricsllm-request-routerllm_request_router_
Stargate client sidecar:9089/metrics by defaultstargate-clientstargate_client_

The request-router chart passes --metrics-prefix=llm_request_router_. Upstream Stargate still defaults to stargate_ when run outside the NVCF chart.

LLM API Gateway

MetricLabels
llm_api_gateway_http_requests_totalmethod, route, status
llm_api_gateway_http_request_duration_secondsmethod, route, status
llm_api_gateway_http_active_requestsmethod, route
llm_api_gateway_upstream_requests_totalupstream, result, status
llm_api_gateway_upstream_request_duration_secondsupstream, result, status
llm_api_gateway_llm_tokens_totalendpoint, token_type, stream
llm_api_gateway_provider_time_secondsendpoint, phase, stream
llm_api_gateway_stream_first_token_secondsendpoint
llm_api_gateway_stream_duration_secondsendpoint, status
llm_api_gateway_pubsub_publish_failures_totalNone
llm_api_gateway_pubsub_consume_failures_totalNone
llm_api_gateway_pubsub_consume_duration_secondsNone
llm_api_gateway_rate_limit_event_replication_lag_secondsNone
llm_api_gateway_rate_limit_events_received_totalNone
llm_api_gateway_rate_limit_events_dropped_totalreason
llm_api_gateway_rate_limit_events_applied_totalNone
llm_api_gateway_rate_limit_events_failed_apply_totalNone
llm_api_gateway_rate_limit_events_dry_run_would_apply_totalNone
llm_api_gateway_rate_limit_synchronizer_publish_duration_secondsNone
llm_api_gateway_rate_limit_synchronizer_queue_wait_secondsNone
llm_api_gateway_rate_limit_synchronizer_queue_lengthNone
llm_api_gateway_rate_limit_synchronizer_events_dropped_totalreason

The sync worker reuses the same telemetry package and emits the rate limit synchronizer and Pub/Sub metrics under the worker service name.

LLM Request Router

MetricLabels
llm_request_router_requests_totalrouting_key, model, inference_server_id, status
llm_request_router_proxy_attempts_totalrouting_key, model, inference_server_id, result
llm_request_router_proxy_retries_totalrouting_key, model, reason
llm_request_router_proxy_retry_exhausted_totalrouting_key, model, reason
llm_request_router_quic_connection_evictions_totalinference_server_id, reason
llm_request_router_quic_hot_path_reconnect_totalinference_server_id, result
llm_request_router_proxy_replay_buffer_bytesmodel
llm_request_router_proxy_duration_secondsrouting_key, model, inference_server_id
llm_request_router_routing_duration_secondsrouting_key, model
llm_request_router_active_inference_serversrouting_key, model

Stargate Client Sidecar

MetricLabels
target_infoservice_version, service_name, commit
stargate_client_requests_inflightmodel
stargate_client_requests_statemodel, state
stargate_client_requests_state_input_tokensmodel, state
stargate_client_requests_totalmodel, routing_key, status
stargate_client_request_time_to_response_headers_secondsmodel, routing_key
stargate_client_request_time_to_first_output_secondsmodel, routing_key
stargate_client_request_time_to_first_token_secondsmodel, routing_key
stargate_client_request_duration_secondsmodel, routing_key, status
stargate_client_request_input_tokens_totalmodel, routing_key, status
stargate_client_request_output_tokens_totalmodel, routing_key, status
stargate_client_request_input_tokensmodel, routing_key, status
stargate_client_request_output_tokensmodel, routing_key, status
stargate_client_registration_stream_connectedrouter
stargate_client_reverse_tunnel_connectedrouter
stargate_client_model_input_tpsmodel
stargate_client_model_output_tpsmodel
stargate_client_model_max_input_tpsmodel
stargate_client_model_max_output_tpsmodel
stargate_client_model_queue_sizemodel
stargate_client_model_queued_input_tokensmodel
stargate_client_model_kv_cache_capacity_tokensmodel
stargate_client_model_kv_cache_used_tokensmodel
stargate_client_model_kv_cache_free_tokensmodel
stargate_client_model_advertised_statusrouter, model, status
stargate_client_retryable_responses_totalinference_server_id, reason, status
stargate_client_nonretryable_failures_totalinference_server_id, reason

Keep request IDs, session IDs, function IDs, organization IDs, project IDs, authorization values, raw prompts, and raw URLs out of metric labels.