Observability#

NIM provides Prometheus metrics indicating request statistics. You can use these metrics to create Grafana dashboards. By default, these metrics are available at http://0.0.0.0:8000/v1/metrics.

Use the following command to retrieve the metrics:

curl -X 'GET' 'http://0.0.0.0:8000/v1/metrics'

The following table describes the available metrics.

Category

Metric Name

Description

GPU

gpu_power_usage_watts

GPU instantaneous power, in watts

gpu_power_limit_watts

Maximum GPU power limit, in watts

gpu_total_energy_consumption_joules

GPU total energy consumption, in joules

gpu_utilization

GPU utilization rate (0.0–1.0)

gpu_memory_total_bytes

Total GPU memory, in bytes

gpu_memory_used_bytes

Used GPU memory, in bytes

Process

process_virtual_memory_bytes

Virtual memory size in bytes

process_resident_memory_bytes

Resident memory size in bytes

process_start_time_seconds

Start time of the process since Unix epoch in seconds

process_cpu_seconds_total

Total user and system CPU time spent in seconds

process_open_fds

Number of open file descriptors

process_max_fds

Maximum number of open file descriptors

Python

python_gc_objects_collected_total

Objects collected during GC

python_gc_objects_uncollectable_total

Uncollectable objects found during GC

python_gc_collections_total

Number of times this generation was collected

python_info

Python platform information

Triton Metrics#

NIM also exposes Triton Inference Server metrics that provide detailed information about model inference performance, request handling, and system utilization. By default, these Triton metrics are available at http://localhost:9002/metrics.

Use the following command to retrieve the Triton metrics:

curl localhost:9002/metrics

Note: When running NIM in a container, ensure that port 9002 is properly forwarded by including the -p 9002:9002 flag in your Docker run command.

The following are key Triton metrics:

  • Request metrics: Counts of successful and failed inference requests.

  • Inference metrics: Request queue times, compute times, and overall request durations.

  • Model metrics: Model loading times, execution counts, and batch statistics.

  • Memory metrics: GPU and CPU memory usage for inference operations.

  • Cache metrics: Response cache hit and miss rates (when caching is enabled).

For comprehensive documentation on all available Triton metrics, their descriptions, and usage examples, see Metrics in the Triton Inference Server guide.