Observability#
NIM provides Prometheus metrics indicating request statistics. You can use these metrics to create Grafana dashboards. By default, these metrics are available at http://0.0.0.0:8000/v1/metrics
.
Use the following command to retrieve the metrics:
curl -X 'GET' 'http://0.0.0.0:8000/v1/metrics'
The following table describes the available metrics.
Category |
Metric Name |
Description |
---|---|---|
GPU |
gpu_power_usage_watts |
GPU instantaneous power, in watts |
gpu_power_limit_watts |
Maximum GPU power limit, in watts |
|
gpu_total_energy_consumption_joules |
GPU total energy consumption, in joules |
|
gpu_utilization |
GPU utilization rate (0.0–1.0) |
|
gpu_memory_total_bytes |
Total GPU memory, in bytes |
|
gpu_memory_used_bytes |
Used GPU memory, in bytes |
|
Process |
process_virtual_memory_bytes |
Virtual memory size in bytes |
process_resident_memory_bytes |
Resident memory size in bytes |
|
process_start_time_seconds |
Start time of the process since Unix epoch in seconds |
|
process_cpu_seconds_total |
Total user and system CPU time spent in seconds |
|
process_open_fds |
Number of open file descriptors |
|
process_max_fds |
Maximum number of open file descriptors |
|
Python |
python_gc_objects_collected_total |
Objects collected during GC |
python_gc_objects_uncollectable_total |
Uncollectable objects found during GC |
|
python_gc_collections_total |
Number of times this generation was collected |
|
python_info |
Python platform information |
Triton Metrics#
NIM also exposes Triton Inference Server metrics that provide detailed information about model inference performance, request handling, and system utilization. By default, these Triton metrics are available at http://localhost:9002/metrics
.
Use the following command to retrieve the Triton metrics:
curl localhost:9002/metrics
Note: When running NIM in a container, ensure that port 9002 is properly forwarded by including the
-p 9002:9002
flag in your Docker run command.
The following are key Triton metrics:
Request metrics: Counts of successful and failed inference requests.
Inference metrics: Request queue times, compute times, and overall request durations.
Model metrics: Model loading times, execution counts, and batch statistics.
Memory metrics: GPU and CPU memory usage for inference operations.
Cache metrics: Response cache hit and miss rates (when caching is enabled).
For comprehensive documentation on all available Triton metrics, their descriptions, and usage examples, see Metrics in the Triton Inference Server guide.