Metrics¶
The Triton Inference server provides Prometheus metrics indicating GPU and request statistics. By default, these metrics are available at http://localhost:8002/metrics. The metrics are only available by accessing the endpoint, and are not pushed or published to any remote server.
The Triton --allow-metrics=false option can be used to disable all metric reporting and --allow-gpu-metrics=false can be used to disable just the GPU Utilization and GPU Memory metrics. The --metrics-port option can be used to select a different port.
The following table describes the available metrics.
Category |
Metric |
Description |
Granularity |
Frequency |
---|---|---|---|---|
GPU
Utilization
|
Power Usage |
GPU instantaneous power |
Per GPU |
Per second |
Power Limit |
Maximum GPU power limit |
Per GPU |
Per second |
|
Energy
Consumption
|
GPU energy consumption in joules
since Triton started
|
Per GPU |
Per second |
|
GPU Utilization |
GPU utilization rate
(0.0 - 1.0)
|
Per GPU |
Per second |
|
GPU
Memory
|
GPU Total
Memory
|
Total GPU memory, in bytes
|
Per GPU |
Per second |
GPU Used
Memory
|
Used GPU memory, in bytes
|
Per GPU |
Per second |
|
Count |
Request Count |
Number of inference requests
|
Per model |
Per request |
Execution Count |
Number of inference executions
(request count / execution count
= average dynamic batch size)
|
Per model |
Per request |
|
Inference Count |
Number of inferences performed
(one request counts as
“batch size” inferences)
|
Per model |
Per request |
|
Latency |
Request Time |
Cummulative end-to-end inference
request handling time
|
Per model |
Per request |
Queue Time |
Cummulative time requests spend
waiting in the scheduling queue
|
Per model |
Per request |
|
Compute Input
Time
|
Cummulative time requests spend
processing inference inputs (in the
framework backend)
|
Per model |
Per request |
|
Compute Time |
Cummulative time requests spend
executing the inference model (in
the framework backend)
|
Per model |
Per request |
|
Compute Output
Time
|
Cummulative time requests spend
processing inference outputs (in the
framework backend)
|
Per model |
Per request |