Metrics

The TensorRT Inference server provides Prometheus metrics indicating GPU and request statistics. By default, these metrics are available at http://localhost:8002/metrics. The TRTIS --metrics-port option can be used to select a different port. The following table describes the available metrics.

Category Metric Description Granularity Frequency
GPU
Utilization
Power Usage GPU instantaneous power Per GPU Per second
Power Limit Maximum GPU power limit Per GPU Per second
Energy
Consumption
GPU energy consumption in joules
since the server started
Per GPU Per second
GPU Utilization
GPU utilization rate
(0.0 - 1.0)
Per GPU Per second
GPU
Memory
GPU Total
Memory
Total GPU memory, in bytes
Per GPU Per second
GPU Used
Memory
Used GPU memory, in bytes
Per GPU Per second
Count Request Count
Number of inference requests
Per model Per request
Execution Count
Number of inference executions
(request count / execution count
= average dynamic batch size)
Per model Per request
Inference Count
Number of inferences performed
(one request counts as
“batch size” inferences)
Per model Per request
Latency Request Time
End-to-end inference request
handling time
Per model Per request
Compute Time
Time a request spends executing
the inference model (in the
framework backend)
Per model Per request
Queue Time
Time a request spends waiting
in the queue
Per model Per request