Metrics¶

The Triton Inference server provides Prometheus metrics indicating GPU and request statistics. By default, these metrics are available at http://localhost:8002/metrics. The metrics are only available by accessing the endpoint, and are not pushed or published to any remote server.

The Triton --allow-metrics=false option can be used to disable all metric reporting and --allow-gpu-metrics=false can be used to disable just the GPU Utilization and GPU Memory metrics. The --metrics-port option can be used to select a different port.

The following table describes the available metrics.

Category	Metric	Description	Granularity	Frequency
GPU Utilization	Power Usage	GPU instantaneous power	Per GPU	Per second
	Power Limit	Maximum GPU power limit	Per GPU	Per second
	Energy Consumption	GPU energy consumption in joules since Triton started	Per GPU	Per second
	GPU Utilization	GPU utilization rate (0.0 - 1.0)	Per GPU	Per second
GPU Memory	GPU Total Memory	Total GPU memory, in bytes	Per GPU	Per second
GPU Memory	GPU Used Memory	Used GPU memory, in bytes	Per GPU	Per second
Count	Request Count	Number of inference requests	Per model	Per request
	Execution Count	Number of inference executions (request count / execution count = average dynamic batch size)	Per model	Per request
	Inference Count	Number of inferences performed (one request counts as “batch size” inferences)	Per model	Per request
Latency	Request Time	End-to-end inference request handling time	Per model	Per request
	Compute Time	Time a request spends executing the inference model (in the framework backend)	Per model	Per request
	Queue Time	Time a request spends waiting in the queue	Per model	Per request