Metrics¶
The inference server provides Prometheus metrics indicating GPU and request statistics. By default, these metrics are available at http://localhost:8002/metrics. The inference server --metrics-port option can be used to select a different port. The following table describes the available metrics.
Category | Metric | Description | Granularity | Frequency |
---|---|---|---|---|
Utilization | Power Usage | GPU instantaneous power | Per GPU | Per second |
Power Limit | Maximum GPU power limit | Per GPU | Per second | |
Energy
Consumption
|
GPU energy consumption in joules
since the server started
|
Per GPU | Per second | |
GPU Utilization | GPU utilization rate
(0.0 - 1.0)
|
Per GPU | Per second | |
Count | Request Count | Number of inference requests
|
Per model | Per request |
Execution Count | Number of inference executions
(request count / execution count
= average dynamic batch size)
|
Per model | Per request | |
Inference Count | Number of inferences performed
(one request counts as
“batch size” inferences)
|
Per model | Per request | |
Latency | Request Time | End-to-end inference request
handling time
|
Per model | Per request |
Compute Time | Time a request spends executing
the inference model (in the
framework backend)
|
Per model | Per request | |
Queue Time | Time a request spends waiting
in the queue
|
Per model | Per request |