Per-Request Metrics (Experimental)#

Warning

This is an experimental feature that may change or be removed in future releases. The metrics format, field names, and calculation methods are subject to modification.

NIM for LLMs provides detailed per-request performance metrics that can be included in API responses. These metrics offer granular insights into token processing, timing, and throughput for individual requests, enabling precise performance analysis and optimization.

Enabling Per-Request Metrics#

Per-request metrics are disabled by default. To enable them, set the following environment variable:

export NIM_PER_REQ_METRICS_ENABLE=1

Or when running with Docker:

docker run -it --rm --gpus all \
  -e NIM_PER_REQ_METRICS_ENABLE=1 \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.0

When enabled, API responses will include a stats field with detailed timing and token metrics.

Supported Endpoints#

Per-request metrics are available for:

  • /v1/completions (streaming and non-streaming)

  • /v1/chat/completions (streaming and non-streaming)

Response Schema#

When enabled, responses include a stats object with the following structure:

Top-Level Stats Object#

Field

Type

Description

type

string

Always "NIM LLM Model Stats"

version

string

Stats format version (currently "0.1.0")

llm_input_token_length

integer

Number of tokens in the user’s prompt after processing by the chat template

llm_output_token_length

integer

Total number of tokens generated by the LLM in its response

generation_time_in_ms

float

Total time for LLM processing (milliseconds)

time_in_queue_in_ms

float

Time spent waiting in queue (milliseconds, vLLM only)

response_tokens

object

Detailed token timing statistics

Response Tokens Object#

Field

Type

Description

response_token_length

integer

Number of output tokens (same as llm_output_token_length)

time_to_first_token_in_ms

float

Time from processing start to first token

token_to_token_time_in_ms

float

Average time between subsequent tokens

tokens_per_second

float

Generation throughput (tokens/second)

Example Response#

{
  "id": "cmpl-526b85edd5c6448e9814631590f188dc",
  "object": "text_completion",
  "created": 1756340115,
  "model": "bigcode/starcoder2-7b",
  "choices": [
    {
      "index": 0,
      "text": " by @\"\nvar WORLD string = \"?lang=en_us",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": []
    }
  ],
  "usage": {
    "prompt_tokens": 3,
    "total_tokens": 18,
    "completion_tokens": 15,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null,
  "stats": {
    "type": "NIM LLM Model Stats",
    "version": "0.1.0",
    "llm_input_token_length": 3,
    "llm_output_token_length": 15,
    "generation_time_in_ms": 237.8466129,
    "time_in_queue_in_ms": 0.0,
    "response_tokens": {
      "response_token_length": 15,
      "time_to_first_token_in_ms": 237.834453,
      "token_to_token_time_in_ms": 0.000868,
      "tokens_per_second": 63.065854
    }
  }
}

Backend-Specific Behavior#

vLLM Backend#

  • High precision: Uses dedicated timing points from vLLM’s internal metrics

  • Queue separation: Provides separate time_in_queue_in_ms metric

  • Generation time scope: Measures pure LLM processing time (excludes queue time)

  • TTFT reference: Calculated from when LLM processing starts (after queue)

TensorRT-LLM Backend#

  • Estimation-based: Uses available timing points for calculation

  • End-to-end timing: Generation time includes total request time (queue + processing)

  • No queue separation: time_in_queue_in_ms is always null

  • TTFT reference: Calculated from request arrival time

Note

Due to different timing reference points, the same request may show different metric values between vLLM and TensorRT-LLM backends. vLLM provides pure LLM performance metrics, while TensorRT-LLM provides end-to-end latency metrics.

Streaming vs Non-Streaming#

  • Calculation consistency: Metrics use the same formulas for both modes

  • Token counting: Streaming responses accumulate tokens progressively

  • Final values: Only the final (complete) metrics are meaningful for analysis

Known Issues#

TensorRT-LLM LLM API Non-Streaming Timing#

Warning

Known Issue: TensorRT-LLM LLM API non-streaming requests may have less accurate timing metrics compared to vLLM due to limited timing granularity. This affects the precision of generation_time_in_ms, time_to_first_token_in_ms, and derived metrics like tokens_per_second.

Workaround: For more accurate timing analysis with TensorRT-LLM, consider using streaming mode or vLLM backend when possible.

Use Cases#

  • Performance optimization: Identify bottlenecks in token generation

  • SLA monitoring: Track time to first token for user experience

  • Capacity planning: Analyze throughput patterns and queue behavior

  • A/B testing: Compare performance between different configurations

  • Debugging: Investigate latency issues with granular timing data