LLM API Gateway | NVIDIA Cloud Functions

The LLM API Gateway serves Prometheus metrics from llm-api-gateway:9464/metrics when llmApiGateway.metrics.enabled is true.

The self-managed stack maps global.observability.metrics.enabled to this chart value. The gateway metrics use the llm_api_gateway_ service prefix and must not emit legacy service-prefixed metric names.

Label Boundaries

Use bounded labels only. Do not add request IDs, session IDs, function IDs, organization IDs, project IDs, raw URLs, raw prompts, authorization values, or other unbounded request fields as metric labels.

Metrics

Metric name	Type	Source endpoint	Labels	Notes
`llm_api_gateway_http_requests_total`	Counter	`llm-api-gateway:9464/metrics`	`method`, `route`, `status`	Total inbound HTTP requests. `route` is the templated route, not the raw URL path.
`llm_api_gateway_http_request_duration_seconds`	Histogram	`llm-api-gateway:9464/metrics`	`method`, `route`, `status`	Inbound HTTP request duration in seconds.
`llm_api_gateway_http_active_requests`	Gauge	`llm-api-gateway:9464/metrics`	`method`, `route`	Current in-flight inbound HTTP requests.
`llm_api_gateway_upstream_requests_total`	Counter	`llm-api-gateway:9464/metrics`	`upstream`, `result`, `status`	Total outbound upstream requests. `upstream` is a bounded service name such as `llm-request-router`.
`llm_api_gateway_upstream_request_duration_seconds`	Histogram	`llm-api-gateway:9464/metrics`	`upstream`, `result`, `status`	Outbound upstream request duration in seconds.
`llm_api_gateway_llm_tokens_total`	Counter	`llm-api-gateway:9464/metrics`	`endpoint`, `token_type`, `stream`	LLM token counts reported by upstream providers. `token_type` is a bounded enum such as `prompt`, `completion`, or `total`.
`llm_api_gateway_provider_time_seconds`	Histogram	`llm-api-gateway:9464/metrics`	`endpoint`, `phase`, `stream`	Provider-reported timing phases in seconds.
`llm_api_gateway_stream_first_token_seconds`	Histogram	`llm-api-gateway:9464/metrics`	`endpoint`	Time from stream request start to first token in seconds.
`llm_api_gateway_stream_duration_seconds`	Histogram	`llm-api-gateway:9464/metrics`	`endpoint`, `status`	Total stream duration in seconds.
`llm_api_gateway_pubsub_publish_failures_total`	Counter	`llm-api-gateway:9464/metrics`	None	Number of messages that failed to publish.
`llm_api_gateway_pubsub_consume_failures_total`	Counter	`llm-api-gateway:9464/metrics`	None	Number of messages that failed to consume.
`llm_api_gateway_pubsub_consume_duration_seconds`	Histogram	`llm-api-gateway:9464/metrics`	None	Time to consume a message in seconds.
`llm_api_gateway_rate_limit_event_replication_lag_seconds`	Histogram	`llm-api-gateway:9464/metrics`	None	Lag between rate limit event creation and processing in seconds.
`llm_api_gateway_rate_limit_events_received_total`	Counter	`llm-api-gateway:9464/metrics`	None	Number of rate limit events received from the sync transport.
`llm_api_gateway_rate_limit_events_dropped_total`	Counter	`llm-api-gateway:9464/metrics`	`reason`	Number of received rate limit events dropped. `reason` is a bounded enum such as `same_cluster`, `old_message`, or `remote_apply_disabled`.
`llm_api_gateway_rate_limit_events_applied_total`	Counter	`llm-api-gateway:9464/metrics`	None	Number of rate limit events applied to the local limiter.
`llm_api_gateway_rate_limit_events_failed_apply_total`	Counter	`llm-api-gateway:9464/metrics`	None	Number of rate limit events that failed to apply locally.
`llm_api_gateway_rate_limit_events_dry_run_would_apply_total`	Counter	`llm-api-gateway:9464/metrics`	None	Number of rate limit events that would apply when remote application is disabled.
`llm_api_gateway_rate_limit_synchronizer_publish_duration_seconds`	Histogram	`llm-api-gateway:9464/metrics`	None	Time to publish a rate limit event in seconds.
`llm_api_gateway_rate_limit_synchronizer_queue_wait_seconds`	Histogram	`llm-api-gateway:9464/metrics`	None	Time spent queueing a rate limit event in seconds.
`llm_api_gateway_rate_limit_synchronizer_queue_length`	Gauge	`llm-api-gateway:9464/metrics`	None	Current rate limit synchronizer queue length.
`llm_api_gateway_rate_limit_synchronizer_events_dropped_total`	Counter	`llm-api-gateway:9464/metrics`	`reason`	Number of rate limit events dropped before publishing. `reason` is a bounded enum such as `old_message`.