> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nvcf/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nvcf/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nvcf/_mcp/server.

# LLM API Gateway Metrics

The LLM API Gateway serves Prometheus metrics from
`llm-api-gateway:9464/metrics` when
`llmApiGateway.metrics.enabled` is `true`.

The self-managed stack maps `global.observability.metrics.enabled` to this chart
value. The gateway metrics use the `llm_api_gateway_` service prefix and must not
emit legacy service-prefixed metric names.

## Label Boundaries

Use bounded labels only. Do not add request IDs, session IDs, function IDs,
organization IDs, project IDs, raw URLs, raw prompts, authorization values, or
other unbounded request fields as metric labels.

## Metrics

| Metric name | Type | Source endpoint | Labels | Notes |
| --- | --- | --- | --- | --- |
| `llm_api_gateway_http_requests_total` | Counter | `llm-api-gateway:9464/metrics` | `method`, `route`, `status` | Total inbound HTTP requests. `route` is the templated route, not the raw URL path. |
| `llm_api_gateway_http_request_duration_seconds` | Histogram | `llm-api-gateway:9464/metrics` | `method`, `route`, `status` | Inbound HTTP request duration in seconds. |
| `llm_api_gateway_http_active_requests` | Gauge | `llm-api-gateway:9464/metrics` | `method`, `route` | Current in-flight inbound HTTP requests. |
| `llm_api_gateway_upstream_requests_total` | Counter | `llm-api-gateway:9464/metrics` | `upstream`, `result`, `status` | Total outbound upstream requests. `upstream` is a bounded service name such as `llm-request-router`. |
| `llm_api_gateway_upstream_request_duration_seconds` | Histogram | `llm-api-gateway:9464/metrics` | `upstream`, `result`, `status` | Outbound upstream request duration in seconds. |
| `llm_api_gateway_llm_tokens_total` | Counter | `llm-api-gateway:9464/metrics` | `endpoint`, `token_type`, `stream` | LLM token counts reported by upstream providers. `token_type` is a bounded enum such as `prompt`, `completion`, or `total`. |
| `llm_api_gateway_provider_time_seconds` | Histogram | `llm-api-gateway:9464/metrics` | `endpoint`, `phase`, `stream` | Provider-reported timing phases in seconds. |
| `llm_api_gateway_stream_first_token_seconds` | Histogram | `llm-api-gateway:9464/metrics` | `endpoint` | Time from stream request start to first token in seconds. |
| `llm_api_gateway_stream_duration_seconds` | Histogram | `llm-api-gateway:9464/metrics` | `endpoint`, `status` | Total stream duration in seconds. |
| `llm_api_gateway_pubsub_publish_failures_total` | Counter | `llm-api-gateway:9464/metrics` | None | Number of messages that failed to publish. |
| `llm_api_gateway_pubsub_consume_failures_total` | Counter | `llm-api-gateway:9464/metrics` | None | Number of messages that failed to consume. |
| `llm_api_gateway_pubsub_consume_duration_seconds` | Histogram | `llm-api-gateway:9464/metrics` | None | Time to consume a message in seconds. |
| `llm_api_gateway_rate_limit_event_replication_lag_seconds` | Histogram | `llm-api-gateway:9464/metrics` | None | Lag between rate limit event creation and processing in seconds. |
| `llm_api_gateway_rate_limit_events_received_total` | Counter | `llm-api-gateway:9464/metrics` | None | Number of rate limit events received from the sync transport. |
| `llm_api_gateway_rate_limit_events_dropped_total` | Counter | `llm-api-gateway:9464/metrics` | `reason` | Number of received rate limit events dropped. `reason` is a bounded enum such as `same_cluster`, `old_message`, or `remote_apply_disabled`. |
| `llm_api_gateway_rate_limit_events_applied_total` | Counter | `llm-api-gateway:9464/metrics` | None | Number of rate limit events applied to the local limiter. |
| `llm_api_gateway_rate_limit_events_failed_apply_total` | Counter | `llm-api-gateway:9464/metrics` | None | Number of rate limit events that failed to apply locally. |
| `llm_api_gateway_rate_limit_events_dry_run_would_apply_total` | Counter | `llm-api-gateway:9464/metrics` | None | Number of rate limit events that would apply when remote application is disabled. |
| `llm_api_gateway_rate_limit_synchronizer_publish_duration_seconds` | Histogram | `llm-api-gateway:9464/metrics` | None | Time to publish a rate limit event in seconds. |
| `llm_api_gateway_rate_limit_synchronizer_queue_wait_seconds` | Histogram | `llm-api-gateway:9464/metrics` | None | Time spent queueing a rate limit event in seconds. |
| `llm_api_gateway_rate_limit_synchronizer_queue_length` | Gauge | `llm-api-gateway:9464/metrics` | None | Current rate limit synchronizer queue length. |
| `llm_api_gateway_rate_limit_synchronizer_events_dropped_total` | Counter | `llm-api-gateway:9464/metrics` | `reason` | Number of rate limit events dropped before publishing. `reason` is a bounded enum such as `old_message`. |