Logging and Observability#

NIM LLM provides structured logging, Prometheus-compatible metrics, and distributed tracing support to integrate with standard observability stacks.

Structured Logging#

NIM outputs logs to stderr. The following environment variables control log behavior:

Variable

Description

Default

NIM_LOG_LEVEL

Log verbosity. Accepts standard Python logging levels: DEBUG, INFO, WARNING, ERROR, and CRITICAL.

WARNING

NIM_JSONL_LOGGING

Emit logs as JSON Lines for machine parsing. Set to true to enable.

false

Set the Log Level#

Set NIM_LOG_LEVEL when starting the container as shown in the following command:

docker run --gpus all -p 8000:8000 \
  -e NGC_API_KEY \
  -e NIM_LOG_LEVEL=INFO \
  <image>

Default log output uses a human-readable format:

INFO 2026-03-03 12:00:10.079 start_server.py:42] Server starting on port 8000

NIM_LOG_LEVEL controls both the NIM and the vLLM backend. If you need to override the backend log level independently, set VLLM_LOGGING_LEVEL in addition to NIM_LOG_LEVEL.

Enable JSON Lines Output#

When NIM_JSONL_LOGGING=true, all log output from both the NIM and the vLLM backend is formatted as one JSON object per line using a unified format. This is recommended for production deployments where logs are ingested by a collector such as Fluentd, Logstash, or CloudWatch.

The following command shows how to enable JSON Lines output:

docker run --gpus all -p 8000:8000 \
  -e NGC_API_KEY \
  -e NIM_JSONL_LOGGING=true \
  <image>

Example JSON Lines output:

{"level":"INFO","time":"2026-03-03 12:00:10","file_name":"start_server.py","line_number":42,"message":"Server starting on port 8000"}

Both NIM and vLLM logs share the same JSON schema, so a single log parser configuration handles all container output.

Metrics#

NIM exposes Prometheus-compatible metrics at the /v1/metrics endpoint.

Run the following command to retrieve the metrics:

curl -s http://localhost:8000/v1/metrics

Design Principle#

NIM passes through the inference backend’s native Prometheus metrics without modification. The metrics available at /v1/metrics are the same metrics produced by vLLM, covering request latency, throughput, queue depth, token counts, and GPU utilization. NIM does not wrap, rename, or abstract these metrics, so existing vLLM dashboards and alerting rules work without changes.

For the full list of available metrics and their descriptions, refer to the vLLM production metrics documentation.

Prometheus Scrape Configuration#

To scrape NIM metrics with Prometheus, add the following job to your prometheus.yml:

scrape_configs:
  - job_name: "nim-llm"
    scrape_interval: 15s
    metrics_path: "/v1/metrics"
    static_configs:
      - targets: ["<nim-host>:8000"]

Distributed Tracing#

NIM supports request correlation and distributed tracing through standard HTTP headers. The proxy layer forwards tracing headers from clients to the inference backend on the following inference endpoints:

  • /v1/chat/completions

  • /v1/completions

  • /v1/embeddings

  • /v1/responses

Tracing headers are not forwarded on management or health endpoints.

X-Request-Id#

Include an X-Request-Id header in your request to tag it with a correlation identifier. NIM forwards this value to the backend, where the backend adopts it as the internal request ID. If the header is not present, the backend generates a random ID instead.

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Request-Id: req-abc-123" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }'

When correlating logs across services, the X-Request-Id value appears in backend log entries associated with that request because it is used as the request identifier throughout the inference pipeline.

W3C Traceparent#

NIM forwards the W3C Trace Context traceparent header to the inference backend. If your application uses OpenTelemetry or another W3C-compatible tracing system, include the traceparent header to propagate trace context through NIM:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }'

When OpenTelemetry tracing is enabled on the backend, this allows inference spans to appear in your existing distributed traces alongside upstream and downstream services.

Note

The proxy layer always forwards the traceparent header to the backend. However, the backend only creates OpenTelemetry spans from it when tracing is enabled through the OTEL_EXPORTER_OTLP_TRACES_ENDPOINT environment variable. Without this configuration, the backend logs a warning and discards the trace context. For details on configuring OpenTelemetry in the backend, refer to the vLLM OpenTelemetry documentation.

Supported Tracing Headers#

The following table describes the tracing headers that NIM forwards on inference endpoints:

Header

Standard

Description

X-Request-Id

Custom

Request correlation identifier. Forwarded to the backend and adopted as the internal request ID for logging and diagnostics.

traceparent

W3C Trace Context

Propagates distributed trace context to the inference backend. Requires OpenTelemetry to be enabled on the backend to generate spans.

Note

Tracing headers are forwarded only on inference endpoints. Management endpoints (/v1/metrics, /v1/models, and so on) and health endpoints do not propagate tracing headers.