Observability#

NIM provides Prometheus metrics indicating request statistics. You can use these metrics to create Grafana dashboards. By default, these metrics are available at http://0.0.0.0:8000/v1/metrics.

Use the following command to retrieve the metrics:

curl -X 'GET' 'http://0.0.0.0:8000/v1/metrics'

The following table describes the available metrics.

Category	Metric Name	Description
GPU	gpu_power_usage_watts	GPU instantaneous power, in watts
	gpu_power_limit_watts	Maximum GPU power limit, in watts
	gpu_total_energy_consumption_joules	GPU total energy consumption, in joules
	gpu_utilization	GPU utilization rate (0.0–1.0)
	gpu_memory_total_bytes	Total GPU memory, in bytes
	gpu_memory_used_bytes	Used GPU memory, in bytes
Process	process_virtual_memory_bytes	Virtual memory size in bytes
	process_resident_memory_bytes	Resident memory size in bytes
	process_start_time_seconds	Start time of the process since Unix epoch in seconds
	process_cpu_seconds_total	Total user and system CPU time spent in seconds
	process_open_fds	Number of open file descriptors
	process_max_fds	Maximum number of open file descriptors
Python	python_gc_objects_collected_total	Objects collected during GC
	python_gc_objects_uncollectable_total	Uncollectable objects found during GC
	python_gc_collections_total	Number of times this generation was collected
	python_info	Python platform information

Triton Metrics#

NIM also exposes Triton Inference Server metrics that provide detailed information about model inference performance, request handling, and system utilization. By default, these Triton metrics are available at http://localhost:9002/metrics.

Use the following command to retrieve the Triton metrics:

curl localhost:9002/metrics

Note: When running NIM in a container, ensure that port 9002 is properly forwarded by including the -p 9002:9002 flag in your Docker run command.

The following are key Triton metrics:

Request metrics: Counts of successful and failed inference requests.
Inference metrics: Request queue times, compute times, and overall request durations.
Model metrics: Model loading times, execution counts, and batch statistics.
Memory metrics: GPU and CPU memory usage for inference operations.
Cache metrics: Response cache hit and miss rates (when caching is enabled).

For comprehensive documentation on all available Triton metrics, their descriptions, and usage examples, see Metrics in the Triton Inference Server guide.