Per-Request Metrics (Experimental)#

Warning

This is an experimental feature that may change or be removed in future releases. The metrics format, field names, and calculation methods are subject to modification.

NIM for LLMs provides detailed per-request performance metrics that can be included in API responses. These metrics offer granular insights into token processing, timing, and throughput for individual requests, enabling precise performance analysis and optimization.

Enabling Per-Request Metrics#

Per-request metrics are disabled by default. To enable them, set the following environment variable:

export NIM_PER_REQ_METRICS_ENABLE=1

Or when running with Docker:

docker run -it --rm --gpus all \
  -e NIM_PER_REQ_METRICS_ENABLE=1 \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.0

When enabled, API responses will include a stats field with detailed timing and token metrics.

Supported Endpoints#

Per-request metrics are available for:

/v1/completions (streaming and non-streaming)
/v1/chat/completions (streaming and non-streaming)

Response Schema#

When enabled, responses include a stats object with the following structure:

Top-Level Stats Object#

Field	Type	Description
`type`	string	Always `"NIM LLM Model Stats"`
`version`	string	Stats format version (currently `"0.1.0"`)
`llm_input_token_length`	integer	Number of tokens in the user’s prompt after processing by the chat template
`llm_output_token_length`	integer	Total number of tokens generated by the LLM in its response
`generation_time_in_ms`	float	Total time for LLM processing (milliseconds)
`time_in_queue_in_ms`	float	Time spent waiting in queue (milliseconds, vLLM only)
`response_tokens`	object	Detailed token timing statistics

Response Tokens Object#

Field	Type	Description
`response_token_length`	integer	Number of output tokens (same as `llm_output_token_length`)
`time_to_first_token_in_ms`	float	Time from processing start to first token
`token_to_token_time_in_ms`	float	Average time between subsequent tokens
`tokens_per_second`	float	Generation throughput (tokens/second)

Example Response#

{
  "id": "cmpl-526b85edd5c6448e9814631590f188dc",
  "object": "text_completion",
  "created": 1756340115,
  "model": "bigcode/starcoder2-7b",
  "choices": [
    {
      "index": 0,
      "text": " by @\"\nvar WORLD string = \"?lang=en_us",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": []
    }
  ],
  "usage": {
    "prompt_tokens": 3,
    "total_tokens": 18,
    "completion_tokens": 15,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null,
  "stats": {
    "type": "NIM LLM Model Stats",
    "version": "0.1.0",
    "llm_input_token_length": 3,
    "llm_output_token_length": 15,
    "generation_time_in_ms": 237.8466129,
    "time_in_queue_in_ms": 0.0,
    "response_tokens": {
      "response_token_length": 15,
      "time_to_first_token_in_ms": 237.834453,
      "token_to_token_time_in_ms": 0.000868,
      "tokens_per_second": 63.065854
    }
  }
}

Backend-Specific Behavior#

vLLM Backend#

High precision: Uses dedicated timing points from vLLM’s internal metrics
Queue separation: Provides separate time_in_queue_in_ms metric
Generation time scope: Measures pure LLM processing time (excludes queue time)
TTFT reference: Calculated from when LLM processing starts (after queue)

TensorRT-LLM Backend#

Estimation-based: Uses available timing points for calculation
End-to-end timing: Generation time includes total request time (queue + processing)
No queue separation: time_in_queue_in_ms is always null
TTFT reference: Calculated from request arrival time

Note

Due to different timing reference points, the same request may show different metric values between vLLM and TensorRT-LLM backends. vLLM provides pure LLM performance metrics, while TensorRT-LLM provides end-to-end latency metrics.

Streaming vs Non-Streaming#

Calculation consistency: Metrics use the same formulas for both modes
Token counting: Streaming responses accumulate tokens progressively
Final values: Only the final (complete) metrics are meaningful for analysis

Known Issues#

TensorRT-LLM LLM API Non-Streaming Timing#

Warning

Known Issue: TensorRT-LLM LLM API non-streaming requests may have less accurate timing metrics compared to vLLM due to limited timing granularity. This affects the precision of generation_time_in_ms, time_to_first_token_in_ms, and derived metrics like tokens_per_second.

Workaround: For more accurate timing analysis with TensorRT-LLM, consider using streaming mode or vLLM backend when possible.

Use Cases#

Performance optimization: Identify bottlenecks in token generation
SLA monitoring: Track time to first token for user experience
Capacity planning: Analyze throughput patterns and queue behavior
A/B testing: Compare performance between different configurations
Debugging: Investigate latency issues with granular timing data