Metrics

The Triton Inference server provides Prometheus metrics indicating GPU and request statistics. By default, these metrics are available at http://localhost:8002/metrics. The metrics are only available by accessing the endpoint, and are not pushed or published to any remote server.

The Triton --allow-metrics=false option can be used to disable all metric reporting and --allow-gpu-metrics=false can be used to disable just the GPU Utilization and GPU Memory metrics. The --metrics-port option can be used to select a different port.

The following table describes the available metrics.

Category

Metric

Description

Granularity

Frequency

GPU
Utilization

Power Usage

GPU instantaneous power

Per GPU

Per second

Power Limit

Maximum GPU power limit

Per GPU

Per second

Energy
Consumption
GPU energy consumption in joules
since Triton started

Per GPU

Per second

GPU Utilization

GPU utilization rate
(0.0 - 1.0)

Per GPU

Per second

GPU
Memory
GPU Total
Memory
Total GPU memory, in bytes

Per GPU

Per second

GPU Used
Memory
Used GPU memory, in bytes

Per GPU

Per second

Count

Request Count

Number of inference requests

Per model

Per request

Execution Count

Number of inference executions
(request count / execution count
= average dynamic batch size)

Per model

Per request

Inference Count

Number of inferences performed
(one request counts as
“batch size” inferences)

Per model

Per request

Latency

Request Time

Cummulative end-to-end inference
request handling time

Per model

Per request

Queue Time

Cummulative time requests spend
waiting in the scheduling queue

Per model

Per request

Compute Input
Time
Cummulative time requests spend
processing inference inputs (in the
framework backend)

Per model

Per request

Compute Time

Cummulative time requests spend
executing the inference model (in
the framework backend)

Per model

Per request

Compute Output
Time
Cummulative time requests spend
processing inference outputs (in the
framework backend)

Per model

Per request