Per-Request Metrics (Experimental)#
Warning
This is an experimental feature that may change or be removed in future releases. The metrics format, field names, and calculation methods are subject to modification.
NIM for LLMs provides detailed per-request performance metrics that can be included in API responses. These metrics offer granular insights into token processing, timing, and throughput for individual requests, enabling precise performance analysis and optimization.
Enabling Per-Request Metrics#
Per-request metrics are disabled by default. To enable them, set the following environment variable:
export NIM_PER_REQ_METRICS_ENABLE=1
Or when running with Docker:
docker run -it --rm --gpus all \
-e NIM_PER_REQ_METRICS_ENABLE=1 \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.0
When enabled, API responses will include a stats
field with detailed timing and token metrics.
Supported Endpoints#
Per-request metrics are available for:
/v1/completions
(streaming and non-streaming)/v1/chat/completions
(streaming and non-streaming)
Response Schema#
When enabled, responses include a stats
object with the following structure:
Top-Level Stats Object#
Field |
Type |
Description |
---|---|---|
|
string |
Always |
|
string |
Stats format version (currently |
|
integer |
Number of tokens in the user’s prompt after processing by the chat template |
|
integer |
Total number of tokens generated by the LLM in its response |
|
float |
Total time for LLM processing (milliseconds) |
|
float |
Time spent waiting in queue (milliseconds, vLLM only) |
|
object |
Detailed token timing statistics |
Response Tokens Object#
Field |
Type |
Description |
---|---|---|
|
integer |
Number of output tokens (same as |
|
float |
Time from processing start to first token |
|
float |
Average time between subsequent tokens |
|
float |
Generation throughput (tokens/second) |
Example Response#
{
"id": "cmpl-526b85edd5c6448e9814631590f188dc",
"object": "text_completion",
"created": 1756340115,
"model": "bigcode/starcoder2-7b",
"choices": [
{
"index": 0,
"text": " by @\"\nvar WORLD string = \"?lang=en_us",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": []
}
],
"usage": {
"prompt_tokens": 3,
"total_tokens": 18,
"completion_tokens": 15,
"prompt_tokens_details": null
},
"kv_transfer_params": null,
"stats": {
"type": "NIM LLM Model Stats",
"version": "0.1.0",
"llm_input_token_length": 3,
"llm_output_token_length": 15,
"generation_time_in_ms": 237.8466129,
"time_in_queue_in_ms": 0.0,
"response_tokens": {
"response_token_length": 15,
"time_to_first_token_in_ms": 237.834453,
"token_to_token_time_in_ms": 0.000868,
"tokens_per_second": 63.065854
}
}
}
Backend-Specific Behavior#
vLLM Backend#
High precision: Uses dedicated timing points from vLLM’s internal metrics
Queue separation: Provides separate
time_in_queue_in_ms
metricGeneration time scope: Measures pure LLM processing time (excludes queue time)
TTFT reference: Calculated from when LLM processing starts (after queue)
TensorRT-LLM Backend#
Estimation-based: Uses available timing points for calculation
End-to-end timing: Generation time includes total request time (queue + processing)
No queue separation:
time_in_queue_in_ms
is alwaysnull
TTFT reference: Calculated from request arrival time
Note
Due to different timing reference points, the same request may show different metric values between vLLM and TensorRT-LLM backends. vLLM provides pure LLM performance metrics, while TensorRT-LLM provides end-to-end latency metrics.
Streaming vs Non-Streaming#
Calculation consistency: Metrics use the same formulas for both modes
Token counting: Streaming responses accumulate tokens progressively
Final values: Only the final (complete) metrics are meaningful for analysis
Known Issues#
TensorRT-LLM LLM API Non-Streaming Timing#
Warning
Known Issue: TensorRT-LLM LLM API non-streaming requests may have less accurate timing metrics compared to vLLM due to limited timing granularity. This affects the precision of generation_time_in_ms
, time_to_first_token_in_ms
, and derived metrics like tokens_per_second
.
Workaround: For more accurate timing analysis with TensorRT-LLM, consider using streaming mode or vLLM backend when possible.
Use Cases#
Performance optimization: Identify bottlenecks in token generation
SLA monitoring: Track time to first token for user experience
Capacity planning: Analyze throughput patterns and queue behavior
A/B testing: Compare performance between different configurations
Debugging: Investigate latency issues with granular timing data