Logging and Observability#
NIM LLM provides structured logging, Prometheus-compatible metrics, and distributed tracing support to integrate with standard observability stacks.
Structured Logging#
NIM outputs logs to stderr. The following environment variables control log behavior:
Variable |
Description |
Default |
|---|---|---|
|
Log verbosity. Accepts standard Python logging levels: |
|
|
Emit logs as JSON Lines for machine parsing. Set to |
|
Set the Log Level#
Set NIM_LOG_LEVEL when starting the container as shown in the following command:
docker run --gpus all -p 8000:8000 \
-e NGC_API_KEY \
-e NIM_LOG_LEVEL=INFO \
<image>
Default log output uses a human-readable format:
INFO 2026-03-03 12:00:10.079 start_server.py:42] Server starting on port 8000
NIM_LOG_LEVEL controls both the NIM and the vLLM backend. If you need to override
the backend log level independently, set VLLM_LOGGING_LEVEL in addition to
NIM_LOG_LEVEL.
Enable JSON Lines Output#
When NIM_JSONL_LOGGING=true, all log output from both the NIM and the vLLM
backend is formatted as one JSON object per line using a unified format.
This is recommended for production deployments where logs are ingested by a
collector such as Fluentd, Logstash, or CloudWatch.
The following command shows how to enable JSON Lines output:
docker run --gpus all -p 8000:8000 \
-e NGC_API_KEY \
-e NIM_JSONL_LOGGING=true \
<image>
Example JSON Lines output:
{"level":"INFO","time":"2026-03-03 12:00:10","file_name":"start_server.py","line_number":42,"message":"Server starting on port 8000"}
Both NIM and vLLM logs share the same JSON schema, so a single log parser configuration handles all container output.
Metrics#
NIM exposes Prometheus-compatible metrics at the /v1/metrics endpoint.
Run the following command to retrieve the metrics:
curl -s http://localhost:8000/v1/metrics
Design Principle#
NIM passes through the inference backend’s native Prometheus metrics without
modification. The metrics available at /v1/metrics are the same metrics
produced by vLLM, covering request latency, throughput, queue depth, token
counts, and GPU utilization. NIM does not wrap, rename, or abstract these
metrics, so existing vLLM dashboards and alerting rules work without changes.
For the full list of available metrics and their descriptions, refer to the vLLM production metrics documentation.
Prometheus Scrape Configuration#
To scrape NIM metrics with Prometheus, add the following job to your prometheus.yml:
scrape_configs:
- job_name: "nim-llm"
scrape_interval: 15s
metrics_path: "/v1/metrics"
static_configs:
- targets: ["<nim-host>:8000"]
Distributed Tracing#
NIM supports request correlation and distributed tracing through standard HTTP headers. The proxy layer forwards tracing headers from clients to the inference backend on the following inference endpoints:
/v1/chat/completions/v1/completions/v1/embeddings/v1/responses
Tracing headers are not forwarded on management or health endpoints.
X-Request-Id#
Include an X-Request-Id header in your request to tag it with a correlation
identifier. NIM forwards this value to the backend, where the backend adopts
it as the internal request ID. If the header is not present, the backend
generates a random ID instead.
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Request-Id: req-abc-123" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'
When correlating logs across services, the X-Request-Id value appears in
backend log entries associated with that request because it is used as the
request identifier throughout the inference pipeline.
W3C Traceparent#
NIM forwards the
W3C Trace Context traceparent
header to the inference backend. If your application uses OpenTelemetry or
another W3C-compatible tracing system, include the traceparent header to
propagate trace context through NIM:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'
When OpenTelemetry tracing is enabled on the backend, this allows inference spans to appear in your existing distributed traces alongside upstream and downstream services.
Note
The proxy layer always forwards the traceparent header to the backend.
However, the backend only creates OpenTelemetry spans from it when tracing is
enabled through the OTEL_EXPORTER_OTLP_TRACES_ENDPOINT environment variable.
Without this configuration, the backend logs a warning and discards the trace
context. For details on configuring OpenTelemetry in the backend, refer to the
vLLM OpenTelemetry documentation.
Supported Tracing Headers#
The following table describes the tracing headers that NIM forwards on inference endpoints:
Header |
Standard |
Description |
|---|---|---|
|
Custom |
Request correlation identifier. Forwarded to the backend and adopted as the internal request ID for logging and diagnostics. |
|
Propagates distributed trace context to the inference backend. Requires OpenTelemetry to be enabled on the backend to generate spans. |
Note
Tracing headers are forwarded only on inference endpoints. Management
endpoints (/v1/metrics, /v1/models, and so on) and health endpoints do not
propagate tracing headers.